A Scalable Classification Algorithm for Very Large Datasets

被引：0

作者：

Delen, Dursun ^{[1
]}

Kletke, Marilyn ^{[1
]}

Kim, Jin-Hwa ^{[2
]}

机构：

[1] Oklahoma State Univ, Spears Sch Business, Dept Management Sci & Informat Syst, Stillwater, OK 74078 USA

[2] Sogang Univ, Sch Business, Seoul, South Korea

来源：

JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT | 2005年 / 4卷 / 02期

关键词：

Massive datasets; data mining; rule induction; classification; knowledge bases; refinement techniques;

D O I：

10.1142/S0219649205001092

中图分类号：

G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];

学科分类号：

1205 ; 120501 ;

摘要：

Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.

引用

页码：83 / 94

页数：12

共 50 条

[21] A scalable association rule learning and recommendation algorithm for large-scale microarray datasets
Li, Haosong
Sheu, Phillip C-Y
JOURNAL OF BIG DATA, 2022, 9 (01)
[22] Multidimensional Scaling With Very Large Datasets
Paradis, Emmanuel
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2018, 27 (04) : 935 - 939
[23] Analysis of very large voxel datasets
Gorte, Ben
INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2023, 119
[24] Clustering of very large datasets.
Downs, GM
Barnard, JM
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2001, 222 : U396 - U396
[25] The Forward Search for Very Large Datasets
Riani, Marco
Perrotta, Domenico
Cerioli, Andrea
JOURNAL OF STATISTICAL SOFTWARE, 2015, 67 (CS1):
[26] Scalable reduction of large datasets to interesting subsets
Williams, Gregory Todd
Weaver, Jesse
Atre, Medha
Hendler, James A.
JOURNAL OF WEB SEMANTICS, 2010, 8 (04): : 365 - 373
[27] Scalable TSK Fuzzy Modeling for Very Large Datasets Using Minimal-Enclosing-Ball Approximation
Deng, Zhaohong
Choi, Kup-Sze
Chung, Fu-Lai
Wang, Shitong
IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2011, 19 (02) : 210 - 226
[28] Scalable and portable visualization of large atomistic datasets
Sharma, A
Kalia, RK
Nakano, A
Vashishta, P
COMPUTER PHYSICS COMMUNICATIONS, 2004, 163 (01) : 53 - 64
[29] Scalable Distributed Data Anonymization for Large Datasets
di Vimercati, Sabrina De Capitani
Facchinetti, Dario
Foresti, Sara
Livraga, Giovanni
Oldani, Gianluca
Paraboschi, Stefano
Rossi, Matthew
Samarati, Pierangela
IEEE TRANSACTIONS ON BIG DATA, 2023, 9 (03) : 818 - 831
[30] Scalable grid-based clustering algorithm for very large spatial databases
Sun, Yufen
Lu, Yansheng
2006 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PTS 1 AND 2, PROCEEDINGS, 2006, : 763 - 768

← 1 2 3 4 5 →