Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

被引:12
|
作者
Ordonez, Carlos [1 ]
Chen, Zhibo [1 ]
机构
[1] Univ Houston, Dept Comp Sci, Houston, TX 77204 USA
基金
美国国家科学基金会;
关键词
Aggregation; data preparation; pivoting; SQL;
D O I
10.1109/TKDE.2011.16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observation-variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not.
引用
收藏
页码:678 / 691
页数:14
相关论文
共 50 条
  • [21] Declarative data mining using SQL3
    Jamil, HM
    DATABASE SUPPORT FOR DATA MINING APPLICATIONS: DISCOVERING KNOWLEDGE WITH INDUCTIVE QUERIES, 2004, 2682 : 52 - 75
  • [22] Mining HTS data sets.
    Engels, M
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2001, 222 : U408 - U408
  • [23] Rough sets as a framework for data mining
    Butalia, A. H.
    Dhore, M. L.
    IMECS 2007: INTERNATIONAL MULTICONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS, VOLS I AND II, 2007, : 728 - +
  • [24] An experiment with fuzzy sets in data mining
    Olson, David L.
    Moshkovich, Helen
    Mechitov, Alexander
    COMPUTATIONAL SCIENCE - ICCS 2007, PT 2, PROCEEDINGS, 2007, 4488 : 462 - +
  • [25] Efficient SQL-querying method for data mining in large data bases
    Son, NH
    IJCAI-99: PROCEEDINGS OF THE SIXTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 & 2, 1999, : 806 - 811
  • [26] How to prepare data for analysis
    Risse, Michael
    Plant Engineering, 2020, : 16 - 18
  • [27] Data mining on vast data sets as a cluster system benchmark
    Heinecke, Alexander
    Karlstetter, Roman
    Pflueger, Dirk
    Bungartz, Hans-Joachim
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (07): : 2145 - 2165
  • [28] A method generating data sets to test data mining algorithms
    School of Information Science and Engineering, Northeastern University, Shenyang 110004, China
    Dongbei Daxue Xuebao, 2008, 3 (328-331):
  • [29] Data mining of large high throughput screening data sets
    Young, SS
    Rusinko, A
    DIMENSION REDUCTION, COMPUTATIONAL COMPLEXITY AND INFORMATION, 1998, 30 : 543 - 543
  • [30] A novel data structure for efficient representation of large data sets in data mining
    Pai, Radhika M.
    Ananthanarayana, V. S.
    2006 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATIONS, VOLS 1 AND 2, 2007, : 533 - 538