Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

被引:12
|
作者
Ordonez, Carlos [1 ]
Chen, Zhibo [1 ]
机构
[1] Univ Houston, Dept Comp Sci, Houston, TX 77204 USA
基金
美国国家科学基金会;
关键词
Aggregation; data preparation; pivoting; SQL;
D O I
10.1109/TKDE.2011.16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observation-variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not.
引用
收藏
页码:678 / 691
页数:14
相关论文
共 50 条
  • [1] Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner
    Nyaykhor, Rekha S.
    Deotale, Nilesh T.
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2015, 15 (03): : 24 - 28
  • [2] Integrating data mining with SQL databases: OLE DB for data mining
    Netz, A
    Chaudhuri, S
    Fayyad, U
    Bernhardt, J
    17TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2001, : 379 - 387
  • [3] Data Mining on Imbalanced Data Sets
    Gu, Qiong
    Cai, Zhihua
    Zhu, Li
    Huang, Bo
    2008 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER THEORY AND ENGINEERING, 2008, : 1020 - 1024
  • [4] Data mining and metrics on data sets
    Biebler, Karl-Ernst
    Wodny, Michael
    Jaeger, Bernd
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 1, PROCEEDINGS, 2006, : 638 - +
  • [5] SQL & data mining, & genetic programming
    Connolly, B
    DR DOBBS JOURNAL, 2004, 29 (04): : 34 - 39
  • [6] Evolving SQL queries for data mining
    Salim, M
    Yao, X
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 62 - 67
  • [7] Application of SQL Server in Data Mining
    Zhang, Zhansheng
    Wang, Guicheng
    Yang, Lei
    Zhang, Min
    Zhao, Wendan
    Xu, Xinhe
    2010 CHINESE CONTROL AND DECISION CONFERENCE, VOLS 1-5, 2010, : 521 - +
  • [8] Building data warehouse and data mining with SQL Server 2000
    Liu, Shuang-Ying
    Zhang, Jing
    Huabei Gongxueyuan Xuebao/Journal of North China Institute of Technology, 2004, 25 (05):
  • [9] Mining transformed data sets
    Burns, A
    Kusiak, A
    Letsche, T
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2004, 3213 : 148 - 154
  • [10] Distributed Data Mining by Means of SQL Enhancement
    Gorawski, Marcin
    Pluciennik, Ewa
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS: OTM 2008 WORKSHOPS, 2008, 5333 : 34 - 35