A Scalable Classification Algorithm for Very Large Datasets

被引:0
|
作者
Delen, Dursun [1 ]
Kletke, Marilyn [1 ]
Kim, Jin-Hwa [2 ]
机构
[1] Oklahoma State Univ, Spears Sch Business, Dept Management Sci & Informat Syst, Stillwater, OK 74078 USA
[2] Sogang Univ, Sch Business, Seoul, South Korea
关键词
Massive datasets; data mining; rule induction; classification; knowledge bases; refinement techniques;
D O I
10.1142/S0219649205001092
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.
引用
收藏
页码:83 / 94
页数:12
相关论文
共 50 条
  • [21] A scalable association rule learning and recommendation algorithm for large-scale microarray datasets
    Li, Haosong
    Sheu, Phillip C-Y
    JOURNAL OF BIG DATA, 2022, 9 (01)
  • [22] Multidimensional Scaling With Very Large Datasets
    Paradis, Emmanuel
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2018, 27 (04) : 935 - 939
  • [23] Analysis of very large voxel datasets
    Gorte, Ben
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2023, 119
  • [24] Clustering of very large datasets.
    Downs, GM
    Barnard, JM
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2001, 222 : U396 - U396
  • [25] The Forward Search for Very Large Datasets
    Riani, Marco
    Perrotta, Domenico
    Cerioli, Andrea
    JOURNAL OF STATISTICAL SOFTWARE, 2015, 67 (CS1):
  • [26] Scalable reduction of large datasets to interesting subsets
    Williams, Gregory Todd
    Weaver, Jesse
    Atre, Medha
    Hendler, James A.
    JOURNAL OF WEB SEMANTICS, 2010, 8 (04): : 365 - 373
  • [27] Scalable TSK Fuzzy Modeling for Very Large Datasets Using Minimal-Enclosing-Ball Approximation
    Deng, Zhaohong
    Choi, Kup-Sze
    Chung, Fu-Lai
    Wang, Shitong
    IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2011, 19 (02) : 210 - 226
  • [28] Scalable and portable visualization of large atomistic datasets
    Sharma, A
    Kalia, RK
    Nakano, A
    Vashishta, P
    COMPUTER PHYSICS COMMUNICATIONS, 2004, 163 (01) : 53 - 64
  • [29] Scalable Distributed Data Anonymization for Large Datasets
    di Vimercati, Sabrina De Capitani
    Facchinetti, Dario
    Foresti, Sara
    Livraga, Giovanni
    Oldani, Gianluca
    Paraboschi, Stefano
    Rossi, Matthew
    Samarati, Pierangela
    IEEE TRANSACTIONS ON BIG DATA, 2023, 9 (03) : 818 - 831
  • [30] Scalable grid-based clustering algorithm for very large spatial databases
    Sun, Yufen
    Lu, Yansheng
    2006 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PTS 1 AND 2, PROCEEDINGS, 2006, : 763 - 768