DAFEE: A Scalable Distributed Automatic Feature Engineering Algorithm for Relational Datasets

被引:0
|
作者
Zhao, Wenqian [1 ]
Li, Xiangxiang [1 ]
Rong, Guoping [2 ]
Lin, Mufeng [1 ]
Lin, Chen [1 ]
Yang, Yifan [1 ]
机构
[1] Transwarp Technol Shanghai Co Ltd, Shanghai, Peoples R China
[2] Nanjing Univ, Joint Lab Nanjing Univ & Transwarp Data Technol, Nanjing, Peoples R China
关键词
AutoML; Automatic feature engineering; Relational dataset; Big data; Feature selection; Machine learning; FEATURE-SELECTION; CLASSIFICATION;
D O I
10.1007/978-3-030-60239-0_3
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic feature engineering aims to construct informative features automatically and reduce manual labor for machine learning applications. The majority of existing approaches are designed to handle tasks with only one data source, which are less applicable to real scenarios. In this paper, we present a distributed automatic feature engineering algorithm, DAFEE, to generate features among multiple large-scale relational datasets. Starting from the target table, the algorithm uses a Breadth-First-Search type algorithm to find its related tables and constructs advanced high-order features that are remarkably effective in practical applications. Moreover, DAFEE implements a feature selection method to reduce the computational cost and improve predictive performance. Furthermore, it is highly optimized to process a massive volume of data. Experimental results demonstrate that it can significantly improve the predictive performance by 7% compared to SOTA algorithms.
引用
收藏
页码:32 / 46
页数:15
相关论文
共 50 条
  • [1] SAFE: Scalable Automatic Feature Engineering Framework for Industrial Tasks
    Shi, Qitao
    Zhang, Ya-Lin
    Li, Longfei
    Yang, Xinxing
    Li, Meng
    Zhou, Jun
    2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020), 2020, : 1645 - 1656
  • [2] Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering
    Las-Casas, Pedro
    Papakerashvili, Giorgi
    Anand, Vaastav
    Mace, Jonathan
    PROCEEDINGS OF THE 2019 TENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '19), 2019, : 312 - 324
  • [3] RHDOFS: A Distributed Online Algorithm Towards Scalable Streaming Feature Selection
    Luo, Chuan
    Wang, Sizhao
    Li, Tianrui
    Chen, Hongmei
    Lv, Jiancheng
    Yi, Zhang
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2023, 34 (06) : 1830 - 1847
  • [4] Scalable Distributed Data Anonymization for Large Datasets
    di Vimercati, Sabrina De Capitani
    Facchinetti, Dario
    Foresti, Sara
    Livraga, Giovanni
    Oldani, Gianluca
    Paraboschi, Stefano
    Rossi, Matthew
    Samarati, Pierangela
    IEEE TRANSACTIONS ON BIG DATA, 2023, 9 (03) : 818 - 831
  • [5] Clustering and Classification Based on Distributed Automatic Feature Engineering for Customer Segmentation
    Lee, Zne-Jung
    Lee, Chou-Yuan
    Chang, Li-Yun
    Sano, Natsuki
    SYMMETRY-BASEL, 2021, 13 (09):
  • [6] A Scalable Algorithm for Multi-class Support Vector Machine on Geo-Distributed Datasets
    Kabir, Tasnim
    Adnan, Muhammad Abdullah
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 637 - 642
  • [7] HIREL: An Incremental Clustering Algorithm for Relational Datasets
    Li, Tao
    Anand, Sarabjot S.
    ICDM 2008: EIGHTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2008, : 887 - 892
  • [8] A Scalable Classification Algorithm for Very Large Datasets
    Delen, Dursun
    Kletke, Marilyn
    Kim, Jin-Hwa
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2005, 4 (02) : 83 - 94
  • [9] The feature tree: Visualizing feature tracking in distributed AMR datasets
    Chen, J
    Silver, D
    Jiang, L
    PVG 2003 PROCEEDINGS, 2003, : 103 - 110
  • [10] Scalable and Usable Relational Learning With Automatic Language Bias
    Picado, Jose
    Termehchy, Arash
    Fern, Alan
    Pathak, Sudhanshu
    Ilango, Praveen
    Davis, John
    SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 1440 - 1451