Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

被引:12
|
作者
Ordonez, Carlos [1 ]
Chen, Zhibo [1 ]
机构
[1] Univ Houston, Dept Comp Sci, Houston, TX 77204 USA
基金
美国国家科学基金会;
关键词
Aggregation; data preparation; pivoting; SQL;
D O I
10.1109/TKDE.2011.16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observation-variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not.
引用
收藏
页码:678 / 691
页数:14
相关论文
共 50 条
  • [31] New features of categorical principal components analysis for complicated data sets, including data mining
    Meulman, JJ
    van der Kooij, AJ
    Babinec, A
    CLASSIFICATION, AUTOMATION, AND NEW MEDIA, 2002, : 207 - 217
  • [32] Analysis of complex real-time atmospheric data sets: A data mining approach.
    Gross, DS
    Yuen, M
    Ault, AP
    Musicant, DR
    Anderson, BJ
    Ritz, A
    Schauer, JJ
    Ramakrishnan, R
    Chen, L
    Chen, BC
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2005, 229 : U127 - U127
  • [33] Studying on integrating Custom Data Mining Algorithms in SQL Server 2005 Analysis Services
    Wang, Dongyun
    EBM 2010: INTERNATIONAL CONFERENCE ON ENGINEERING AND BUSINESS MANAGEMENT, VOLS 1-8, 2010, : 5243 - 5246
  • [34] Data mining mining data - Ordered vector quantisation and examples of its application to mine geotechnical data sets
    Fraser, S. J.
    Mikula, P. A.
    Lee, M. F.
    Dickson, B. L.
    Kinnersly, E.
    6TH INTERNATIONAL MINING GEOLOGY CONFERENCE, 2006, : 259 - +
  • [35] Developing a Data Mining Model for Predicting with SQL Server 2008
    Wang, Dongyun
    2010 INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING AND COMMUNICATION AND 2010 ASIA-PACIFIC CONFERENCE ON INFORMATION TECHNOLOGY AND OCEAN ENGINEERING: CICC-ITOE 2010, PROCEEDINGS, 2010, : 83 - 86
  • [36] Implementing data mining algorithms with Microsoft SQL server.
    Curotto, CL
    Ebecken, NFF
    DATA MINING III, 2002, 6 : 73 - 82
  • [37] Structured Scheme of Data Mining based on SQL Server 2005
    Han, Xiaohong
    Wang, Lei
    Zhang, Peijun
    MATERIALS SCIENCE AND ENGINEERING, PTS 1-2, 2011, 179-180 : 646 - +
  • [38] Advanced Studying on Microsoft SQL Server 2008 Data Mining
    Wang, Dongyun
    INFORMATION TECHNOLOGY FOR MANUFACTURING SYSTEMS, PTS 1 AND 2, 2010, : 748 - 752
  • [39] Multi-relational data mining in Microsoft® SQL Server™
    Curotto, CL
    Ebecken, NFF
    DATA MINING VI: DATA MINING, TEXT MINING AND THEIR BUSINESS APPLICATIONS, 2005, : 219 - 228
  • [40] Mining internet data sets for computational grids
    Borzemski, L
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 3, PROCEEDINGS, 2005, 3683 : 268 - 274