Efficient approximate top-k mutual information based feature selection

被引:1
|
作者
Salam, Md Abdus [1 ]
Roy, Senjuti Basu [2 ]
Das, Gautam [1 ]
机构
[1] Univ Texas Arlington, Dept Comp Sci & Engn, Arlington, TX 76019 USA
[2] New Jersey Inst Technol, Dept Comp Sci, Newark, NJ 07102 USA
基金
美国国家科学基金会;
关键词
Feature selection; Top-k attribute selection; Mutual information; Data science; FUNCTIONAL-DEPENDENCIES; INFERENCE;
D O I
10.1007/s10844-022-00750-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is an important step in the data science pipeline, and it is critical to develop efficient algorithms for this step. Mutual Information (MI) is one of the important measures used for feature selection, where attributes are sorted according to descending score of MI, and top-k attributes are retained. The goal of this work is to develop a new measure Attribute Average Conflict to effectively approximate top-k attributes, without actually calculating MI. Our proposed method is based on using the database concept of approximate functional dependency to quantify MI rank of attributes which to our knowledge has not been studied before. We demonstrate the effectiveness of our proposed measure with a Monte-Carlo simulation. We also perform extensive experiments using high dimensional synthetic and real datasets with millions of records. Our results show that our proposed method demonstrates perfect accuracy in selecting the top-k attributes, yet is significantly more efficient than state-of-art baselines, including exact methods for computing Mutual Information based feature selection, as well as adaptive random- sampling based approaches. We also investigate the upper and lower bounds of the proposed new measure and show that tighter bounds can be derived by using marginal frequency of attributes in specific arrangements. The bounds on the proposed measure can be used to select top-k attributes without full scan of the dataset in a single pass. We perform experimental evaluation on real datasets to show the accuracy and effectiveness of this approach.
引用
收藏
页码:191 / 223
页数:33
相关论文
共 50 条
  • [21] Feature Selection and Discretization based on Mutual Information
    Sharmin, Sadia
    Ali, Amin Ahsan
    Khan, Muhammad Asif Hossain
    Shoyaib, Mohammad
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON IMAGING, VISION & PATTERN RECOGNITION (ICIVPR), 2017,
  • [22] Conditional Mutual Information based Feature Selection
    Cheng, Hongrong
    Qin, Zhiguang
    Qian, Weizhong
    Liu, Wei
    [J]. KAM: 2008 INTERNATIONAL SYMPOSIUM ON KNOWLEDGE ACQUISITION AND MODELING, PROCEEDINGS, 2008, : 103 - 107
  • [23] A wrapper for feature selection based on mutual information
    Huang, Jinjie
    Cai, Yunze
    Xu, Xiaoming
    [J]. 18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2006, : 618 - +
  • [24] Hybrid Feature Selection: Combining Fisher Criterion and Mutual Information for Efficient Feature Selection
    Dhir, Chandra Shekhar
    Lee, Soo Young
    [J]. ADVANCES IN NEURO-INFORMATION PROCESSING, PT I, 2009, 5506 : 613 - 620
  • [25] PCA based on mutual information for feature selection
    [J]. Fan, X.-L. (fanxueli@mail.ioa.ac.cn), 1600, Northeast University (28):
  • [26] TASM: Top-k Approximate Subtree Matching
    Augsten, Nikolaus
    Barbosa, Denilson
    Boehlen, Michael
    Palpanas, Themis
    [J]. 26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING ICDE 2010, 2010, : 353 - 364
  • [27] Efficient Main-Memory Top-K Selection For Multicore Architectures
    Zois, Vasileios
    Tsotras, Vassilis J.
    Najjar, Walid A.
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 13 (02): : 114 - 127
  • [28] Top-K Ranking Deep Contextual Bandits for Information Selection Systems
    Freeman, Jade
    Rawson, Michael
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2021, : 2209 - 2214
  • [29] Evaluating top-k selection queries
    Chaudhuri, S
    Gravano, L
    [J]. PROCEEDINGS OF THE TWENTY-FIFTH INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, 1999, : 399 - 410
  • [30] Approximate convex skyline: A partitioned layer-based index for efficient processing top-k queries
    Ihm, Sun-Young
    Lee, Ki-Eun
    Nasridinov, Aziz
    Heo, Jun-Seok
    Park, Young-Ho
    [J]. KNOWLEDGE-BASED SYSTEMS, 2014, 61 : 13 - 28