DAFEE: A Scalable Distributed Automatic Feature Engineering Algorithm for Relational Datasets

被引:0
|
作者
Zhao, Wenqian [1 ]
Li, Xiangxiang [1 ]
Rong, Guoping [2 ]
Lin, Mufeng [1 ]
Lin, Chen [1 ]
Yang, Yifan [1 ]
机构
[1] Transwarp Technol Shanghai Co Ltd, Shanghai, Peoples R China
[2] Nanjing Univ, Joint Lab Nanjing Univ & Transwarp Data Technol, Nanjing, Peoples R China
关键词
AutoML; Automatic feature engineering; Relational dataset; Big data; Feature selection; Machine learning; FEATURE-SELECTION; CLASSIFICATION;
D O I
10.1007/978-3-030-60239-0_3
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic feature engineering aims to construct informative features automatically and reduce manual labor for machine learning applications. The majority of existing approaches are designed to handle tasks with only one data source, which are less applicable to real scenarios. In this paper, we present a distributed automatic feature engineering algorithm, DAFEE, to generate features among multiple large-scale relational datasets. Starting from the target table, the algorithm uses a Breadth-First-Search type algorithm to find its related tables and constructs advanced high-order features that are remarkably effective in practical applications. Moreover, DAFEE implements a feature selection method to reduce the computational cost and improve predictive performance. Furthermore, it is highly optimized to process a massive volume of data. Experimental results demonstrate that it can significantly improve the predictive performance by 7% compared to SOTA algorithms.
引用
收藏
页码:32 / 46
页数:15
相关论文
共 50 条
  • [41] Automatic Scalable Size Selection for the Shape of a Distributed Robotic Collective
    Rubenstein, Michael
    Shen, Wei-Min
    IEEE/RSJ 2010 INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2010), 2010, : 508 - 513
  • [42] FEATURE SELECTION FOR IMBALANCED DATASETS BASED ON IMPROVED GENETIC ALGORITHM
    Du, Limin
    Xu, Yang
    Jin, Liuqian
    DECISION MAKING AND SOFT COMPUTING, 2014, 9 : 119 - 124
  • [43] A High-Performance Distributed Relational Database System for Scalable OLAP Processing
    Arnold, Jason
    Glavic, Boris
    Raicu, Ioan
    2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 738 - 748
  • [44] Scalable Distributed Diagnosis Algorithm for Wireless Sensor Networks
    Mahapatro, Arunanshu
    Khilar, Pabitra Mohan
    ADVANCES IN COMPUTING, COMMUNICATION AND CONTROL, 2011, 125 : 400 - 405
  • [45] DisTenC: A Distributed Algorithm for Scalable Tensor Completion on Spark
    Ge, Hancheng
    Zhang, Kai
    Alfifi, Majid
    Hu, Xia
    Caverlee, James
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 137 - 148
  • [46] An efficient and scalable checkpointing and recovery algorithm for distributed systems
    Kumar, K. P. Krishna
    Hansdah, R. C.
    DISTRIBUTED COMPUTING AND NETWORKING, PROCEEDINGS, 2006, 4308 : 94 - 99
  • [47] A scalable, distributed algorithm for allocating workers in embedded systems
    Agassounon, W
    Martinoli, A
    Goodman, R
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 3367 - 3373
  • [48] pTrans: A Scalable Algorithm for Reservation Guarantees in Distributed Systems
    Peng, Yuhan
    Varman, Peter
    PROCEEDINGS OF THE 32ND ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES (SPAA '20), 2020, : 441 - 452
  • [49] Automatic fuzzy-DBSCAN algorithm for morphological and overlapping datasets
    YELGHI Aref
    K?SE Cemal
    YELGHI Asef
    SHAHKAR Amir
    JournalofSystemsEngineeringandElectronics, 2020, 31 (06) : 1245 - 1253
  • [50] An efficient and scalable algorithm for multi-relational frequent pattern discovery
    Zhang, Wei
    Yang, Bingru
    ISDA 2006: SIXTH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, VOL 1, 2006, : 730 - 735