DAFEE: A Scalable Distributed Automatic Feature Engineering Algorithm for Relational Datasets

被引:0
|
作者
Zhao, Wenqian [1 ]
Li, Xiangxiang [1 ]
Rong, Guoping [2 ]
Lin, Mufeng [1 ]
Lin, Chen [1 ]
Yang, Yifan [1 ]
机构
[1] Transwarp Technol Shanghai Co Ltd, Shanghai, Peoples R China
[2] Nanjing Univ, Joint Lab Nanjing Univ & Transwarp Data Technol, Nanjing, Peoples R China
关键词
AutoML; Automatic feature engineering; Relational dataset; Big data; Feature selection; Machine learning; FEATURE-SELECTION; CLASSIFICATION;
D O I
10.1007/978-3-030-60239-0_3
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic feature engineering aims to construct informative features automatically and reduce manual labor for machine learning applications. The majority of existing approaches are designed to handle tasks with only one data source, which are less applicable to real scenarios. In this paper, we present a distributed automatic feature engineering algorithm, DAFEE, to generate features among multiple large-scale relational datasets. Starting from the target table, the algorithm uses a Breadth-First-Search type algorithm to find its related tables and constructs advanced high-order features that are remarkably effective in practical applications. Moreover, DAFEE implements a feature selection method to reduce the computational cost and improve predictive performance. Furthermore, it is highly optimized to process a massive volume of data. Experimental results demonstrate that it can significantly improve the predictive performance by 7% compared to SOTA algorithms.
引用
收藏
页码:32 / 46
页数:15
相关论文
共 50 条
  • [31] A Distributed Algorithm for Scalable Fuzzy Time Series
    de Lima e Silva, Petronio Candido
    de Oliveira e Lucas, Patricia
    Guimaraes, Frederico Gadelha
    GREEN, PERVASIVE, AND CLOUD COMPUTING, GPC 2019, 2019, 11484 : 42 - 56
  • [32] VCube: A Provably Scalable Distributed Diagnosis Algorithm
    Duarte, Elias P., Jr.
    Bona, Luis C. E.
    Ruoso, Vinicius K.
    2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), 2014, : 17 - 22
  • [33] DHC: A Distributed Hierarchical Clustering Algorithm for Large Datasets
    Zhang, Wei
    Zhang, Gongxuan
    Chen, Xiaohui
    Liu, Yueqi
    Zhou, Xiumin
    Zhou, Junlong
    JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2019, 28 (04)
  • [34] Measuring Robustness of Feature Selection Techniques on Software Engineering Datasets
    Wang, Huanjing
    Khoshgoftaar, Taghi M.
    Wald, Randall
    2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2011, : 309 - 314
  • [35] Automatic feature recognition from engineering drawings
    You, CF
    Yang, SS
    INTERNATIONAL JOURNAL OF ADVANCED MANUFACTURING TECHNOLOGY, 1998, 14 (07): : 495 - 507
  • [36] Automatic Feature Engineering by Deep Reinforcement Learning
    Zhang, Jianyu
    Hao, Jianye
    Fogelman-Soulie, Francoise
    Wang, Zan
    AAMAS '19: PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2019, : 2312 - 2314
  • [37] Automatic feature recognition from engineering drawings
    C. F. You
    S. S. Yang
    The International Journal of Advanced Manufacturing Technology, 1998, 14 : 495 - 507
  • [38] A Scalable Approach for Distributed Reasoning over Large-scale OWL Datasets
    Mohamed, Heba
    Fathalla, Said
    Lehmann, Jens
    Jabeen, Hajira
    PROCEEDINGS OF THE 13TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KEOD), VOL 2, 2021, : 51 - 60
  • [39] HMSPKmerCounter: Hadoop based Parallel, Scalable, Distributed Kmer Counter for Large Datasets
    Saravanan, S.
    Athri, Prashanth
    2018 INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND SYSTEMS BIOLOGY (BSB), 2018, : 112 - 118
  • [40] A Scalable Memetic Algorithm for Simultaneous Instance and Feature Selection
    Garcia-Pedrajas, Nicolas
    de Haro-Garcia, Aida
    Perez-Rodriguez, Javier
    EVOLUTIONARY COMPUTATION, 2014, 22 (01) : 1 - 45