Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

被引:12
|
作者
Ordonez, Carlos [1 ]
Chen, Zhibo [1 ]
机构
[1] Univ Houston, Dept Comp Sci, Houston, TX 77204 USA
基金
美国国家科学基金会;
关键词
Aggregation; data preparation; pivoting; SQL;
D O I
10.1109/TKDE.2011.16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observation-variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not.
引用
收藏
页码:678 / 691
页数:14
相关论文
共 50 条
  • [11] ATLaS: A native extension of SQL for data mining
    Wang, HX
    Zaniolo, C
    PROCEEDINGS OF THE THIRD SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2003, : 130 - 141
  • [12] Visual data mining of large spatial data sets
    Keim, DA
    Panse, C
    Sips, M
    DATABASES IN NETWORKED INFORMATION SYSTEMS, PROCEEDINGS, 2003, 2822 : 201 - 215
  • [13] From visualisation to data mining with large data sets
    Adelmann, A
    Ryne, RD
    Shalf, JM
    Siegerist, C
    2005 IEEE PARTICLE ACCELERATOR CONFERENCE (PAC), VOLS 1-4, 2005, : 542 - 544
  • [14] Massive data sets, data mining, and decision support
    Dalal, S
    Dumais, S
    Kettenring, J
    Kurien, V
    McIntosh, A
    Maitra, R
    MINING AND MODELING MASSIVE DATA SETS IN SCIENCE, ENGINEERING, AND BUSINESS WITH A SUBTHEME IN ENVIRONMENTAL STATISTICS, 1997, 29 (01): : 329 - 329
  • [15] Data mining from extreme data sets: Very large and/or very skewed data sets
    Hall, LO
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 2555 - 2555
  • [16] Supporting SQL-3 aggregations on grid-based data repositories
    Weng, L
    Agrawal, G
    Catalyurek, U
    Saltz, J
    LANGUAGES AND COMPILERS FOR HIGH PERFORMANCE COMPUTING, 2005, 3602 : 283 - 298
  • [17] Analysis of user behaviors by mining large network data sets
    Wang, Zhenhua
    Tu, Lai
    Guo, Zhe
    Yang, Laurence T.
    Huang, Benxiong
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2014, 37 : 429 - 437
  • [18] Advanced Studying on Microsoft SQL Server Data Mining
    Ren, Zhijun
    2010 INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING AND COMMUNICATION AND 2010 ASIA-PACIFIC CONFERENCE ON INFORMATION TECHNOLOGY AND OCEAN ENGINEERING: CICC-ITOE 2010, PROCEEDINGS, 2010, : 87 - 89
  • [19] On NIS-Apriori Based Data Mining in SQL
    Sakai, Hiroshi
    Liu, Chenxi
    Zhu, Xiaoxin
    Nakata, Michinori
    ROUGH SETS, (IJCRS 2016), 2016, 9920 : 514 - 524
  • [20] Building Data Mining Applications with SQL Server 2005
    Wang, Dongyun
    Ren, Zhijun
    2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 10859 - 10862