A Scalable Classification Algorithm for Very Large Datasets

被引：0

作者：

Delen, Dursun ^{[1
]}

Kletke, Marilyn ^{[1
]}

Kim, Jin-Hwa ^{[2
]}

机构：

[1] Oklahoma State Univ, Spears Sch Business, Dept Management Sci & Informat Syst, Stillwater, OK 74078 USA

[2] Sogang Univ, Sch Business, Seoul, South Korea

来源：

JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT | 2005年 / 4卷 / 02期

关键词：

Massive datasets; data mining; rule induction; classification; knowledge bases; refinement techniques;

D O I：

10.1142/S0219649205001092

中图分类号：

G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];

学科分类号：

1205 ; 120501 ;

摘要：

Today's organisations are collecting and storing massive amounts of data from their customer transactions and e-commerce/e-business applications. Many classification algorithms are not scalable to work effectively and efficiently with these very large datasets. This study constructs a new scalable classification algorithm (referred to in this manuscript as Iterative Refinement Algorithm, or IRA in short) that builds domain knowledge from very large datasets using an iterative inductive learning mechanism. Unlike existing algorithms that build the complete domain knowledge from a dataset all at once, IRA builds the initial domain knowledge from a subset of the available data and then iteratively improves, sharpens and polishes it using the chucks from the remaining data. Performance testing of IRA on two datasets (one with approximately five million records for a binary classification problem and another with approximately 600K records for a seven-class classification problem) resulted in more accurate domain knowledge as compared to other prediction methods including logistic regression, discriminant analysis, neural networks, C5, CART and CHAID. Unlike other classification algorithms whose performance and accuracy deteriorate as data size increases, the efficacy of IRA improves as datasets become significantly larger.

引用

页码：83 / 94

页数：12

共 50 条

[1] Scalable Computation of Streamlines on Very Large Datasets
Pugmire, Dave
Childs, Hank
Garth, Christoph
Ahern, Sean
Weber, Gunther H.
PROCEEDINGS OF THE CONFERENCE ON HIGH PERFORMANCE COMPUTING NETWORKING, STORAGE AND ANALYSIS, 2009,
[2] ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets
Joshi, MV
Karypis, G
Kumar, V
FIRST MERGED INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM & SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING, 1998, : 573 - 579
[3] Scaling associative classification for very large datasets
Venturini L.
Baralis E.
Garza P.
Venturini, Luca (luca.venturini@polito.it), 1600, SpringerOpen (04)
[4] Asteroid families classification: Exploiting very large datasets
Milani, Andrea
Cellino, Alberto
Knezevic, Zoran
Novakovic, Bojan
Spoto, Federica
Paolicchi, Paolo
ICARUS, 2014, 239 : 46 - 73
[5] Scalable Iterative Classification for Sanitizing Large-Scale Datasets
Li, Bo
Vorobeychik, Yevgeniy
Li, Muqun
Malin, Bradley
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (03) : 698 - 711
[6] Fast Support Vector Machine classification of very large datasets
Fehr, Janis
Arreola, Karina Zapien
Burkhardt, Hans
DATA ANALYSIS, MACHINE LEARNING AND APPLICATIONS, 2008, : 11 - +
[7] A parallel Kohonen algorithm for the classification of large spatial datasets
Openshaw, S
Turton, I
COMPUTERS & GEOSCIENCES, 1996, 22 (09) : 1019 - 1026
[8] A parallel Kohonen algorithm for the classification of large spatial datasets
Openshaw, Stan
Turton, Ian
Computers and Geosciences, 1996, 22 (09): : 1019 - 1026
[9] Parallel Kohonen algorithm for the classification of large spatial datasets
Openshaw, Stan
Turton, Ian
Computers & Geosciences, 1996, 22 (09):
[10] A practical surface reconstruction algorithm for very large medical datasets
Zhao, MC
Tian, J
Li, GM
He, HG
THIRD INTERNATIONAL SYMPOSIUM ON MULTISPECTRAL IMAGE PROCESSING AND PATTERN RECOGNITION, PTS 1 AND 2, 2003, 5286 : 243 - 247

← 1 2 3 4 5 →