Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

被引:12
|
作者
Ordonez, Carlos [1 ]
Chen, Zhibo [1 ]
机构
[1] Univ Houston, Dept Comp Sci, Houston, TX 77204 USA
基金
美国国家科学基金会;
关键词
Aggregation; data preparation; pivoting; SQL;
D O I
10.1109/TKDE.2011.16
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build data sets, where a horizontal layout is required. We propose simple, yet powerful, methods to generate SQL code to return aggregated columns in a horizontal tabular layout, returning a set of numbers instead of one number per row. This new class of functions is called horizontal aggregations. Horizontal aggregations build data sets with a horizontal denormalized layout (e.g., point-dimension, observation-variable, instance-feature), which is the standard layout required by most data mining algorithms. We propose three fundamental methods to evaluate horizontal aggregations: CASE: Exploiting the programming CASE construct; SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs. Experiments with large tables compare the proposed query evaluation methods. Our CASE method has similar speed to the PIVOT operator and it is much faster than the SPJ method. In general, the CASE and PIVOT methods exhibit linear scalability, whereas the SPJ method does not.
引用
收藏
页码:678 / 691
页数:14
相关论文
共 50 条
  • [41] Guest Editorial: Rough Sets and Data Mining
    Sakai, Hiroshi
    Nakata, Michinori
    Wu, Wei-Zhi
    Miao, Duoqian
    Wang, Guoyin
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2019, 4 (04) : 201 - 202
  • [42] Mining combined causes in large data sets
    Ma, Saisai
    Li, Jiuyong
    Liu, Lin
    Thuc Duy Le
    KNOWLEDGE-BASED SYSTEMS, 2016, 92 : 104 - 111
  • [43] Composite rough sets for dynamic data mining
    Zhang, Junbo
    Li, Tianrui
    Chen, Hongmei
    INFORMATION SCIENCES, 2014, 257 : 81 - 100
  • [44] The Research of High Efficient Data Mining Algorithms for Massive Data Sets
    Tao Cuixia
    MECHATRONICS ENGINEERING, COMPUTING AND INFORMATION TECHNOLOGY, 2014, 556-562 : 3901 - 3904
  • [45] Handling of incomplete data sets using ICA and SOM in data mining
    Hongyi Peng
    Siming Zhu
    Neural Computing and Applications, 2007, 16 : 167 - 172
  • [46] Neighborhood Rough Sets for Dynamic Data Mining
    Zhang, Junbo
    Li, Tianrui
    Ruan, Da
    Liu, Dun
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2012, 27 (04) : 317 - 342
  • [47] Mining for empty rectangles in large data sets
    Edmonds, J
    Gryz, J
    Liang, DM
    Miller, RJ
    DATABASE THEORY - ICDT 2001, PROCEEDINGS, 2001, 1973 : 174 - 188
  • [48] Mining bi-sets in numerical data
    Besson, Jeremy
    Robardet, Celine
    De Raedt, Luc
    Boulicaut, Jean-Francois
    KNOWLEDGE DISCOVERY IN INDUCTIVE DATABASES, 2007, 4747 : 11 - +
  • [49] On generalized quantifiers, finite sets and data mining
    Hájek, P
    INTELLIGENT INFORMATION PROCESSING AND WEB MINING, 2003, : 489 - 496
  • [50] The research of data mining based on extension sets
    Lu, Q
    Yu, YQ
    Third International Conference on Information Technology and Applications, Vol 2, Proceedings, 2005, : 234 - 237