Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

被引：12

作者：

Ordonez, Carlos ^{[1
]}

Chen, Zhibo ^{[1
]}

机构：

[1] Univ Houston, Dept Comp Sci, Houston, TX 77204 USA

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2012年 / 24卷 / 04期

基金：

美国国家科学基金会;

关键词：

Aggregation; data preparation; pivoting; SQL;

D O I：

10.1109/TKDE.2011.16

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observation-variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not.

引用

页码：678 / 691

页数：14

共 50 条

[1] Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner
Nyaykhor, Rekha S.
Deotale, Nilesh T.
INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2015, 15 (03): : 24 - 28
[2] Integrating data mining with SQL databases: OLE DB for data mining
Netz, A
Chaudhuri, S
Fayyad, U
Bernhardt, J
17TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2001, : 379 - 387
[3] Data Mining on Imbalanced Data Sets
Gu, Qiong
Cai, Zhihua
Zhu, Li
Huang, Bo
2008 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER THEORY AND ENGINEERING, 2008, : 1020 - 1024
[4] Data mining and metrics on data sets
Biebler, Karl-Ernst
Wodny, Michael
Jaeger, Bernd
INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 1, PROCEEDINGS, 2006, : 638 - +
[5] SQL & data mining, & genetic programming
Connolly, B
DR DOBBS JOURNAL, 2004, 29 (04): : 34 - 39
[6] Evolving SQL queries for data mining
Salim, M
Yao, X
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 62 - 67
[7] Application of SQL Server in Data Mining
Zhang, Zhansheng
Wang, Guicheng
Yang, Lei
Zhang, Min
Zhao, Wendan
Xu, Xinhe
2010 CHINESE CONTROL AND DECISION CONFERENCE, VOLS 1-5, 2010, : 521 - +
[8] Building data warehouse and data mining with SQL Server 2000
Liu, Shuang-Ying
Zhang, Jing
Huabei Gongxueyuan Xuebao/Journal of North China Institute of Technology, 2004, 25 (05):
[9] Mining transformed data sets
Burns, A
Kusiak, A
Letsche, T
KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2004, 3213 : 148 - 154
[10] Distributed Data Mining by Means of SQL Enhancement
Gorawski, Marcin
Pluciennik, Ewa
ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS: OTM 2008 WORKSHOPS, 2008, 5333 : 34 - 35

← 1 2 3 4 5 →