Efficient approximate top-k mutual information based feature selection

被引:1
|
作者
Salam, Md Abdus [1 ]
Roy, Senjuti Basu [2 ]
Das, Gautam [1 ]
机构
[1] Univ Texas Arlington, Dept Comp Sci & Engn, Arlington, TX 76019 USA
[2] New Jersey Inst Technol, Dept Comp Sci, Newark, NJ 07102 USA
基金
美国国家科学基金会;
关键词
Feature selection; Top-k attribute selection; Mutual information; Data science; FUNCTIONAL-DEPENDENCIES; INFERENCE;
D O I
10.1007/s10844-022-00750-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is an important step in the data science pipeline, and it is critical to develop efficient algorithms for this step. Mutual Information (MI) is one of the important measures used for feature selection, where attributes are sorted according to descending score of MI, and top-k attributes are retained. The goal of this work is to develop a new measure Attribute Average Conflict to effectively approximate top-k attributes, without actually calculating MI. Our proposed method is based on using the database concept of approximate functional dependency to quantify MI rank of attributes which to our knowledge has not been studied before. We demonstrate the effectiveness of our proposed measure with a Monte-Carlo simulation. We also perform extensive experiments using high dimensional synthetic and real datasets with millions of records. Our results show that our proposed method demonstrates perfect accuracy in selecting the top-k attributes, yet is significantly more efficient than state-of-art baselines, including exact methods for computing Mutual Information based feature selection, as well as adaptive random- sampling based approaches. We also investigate the upper and lower bounds of the proposed new measure and show that tighter bounds can be derived by using marginal frequency of attributes in specific arrangements. The bounds on the proposed measure can be used to select top-k attributes without full scan of the dataset in a single pass. We perform experimental evaluation on real datasets to show the accuracy and effectiveness of this approach.
引用
收藏
页码:191 / 223
页数:33
相关论文
共 50 条
  • [1] Efficient approximate top-k mutual information based feature selection
    Md Abdus Salam
    Senjuti Basu Roy
    Gautam Das
    [J]. Journal of Intelligent Information Systems, 2023, 61 : 191 - 223
  • [2] Efficient Approximate Solutions to Mutual Information Based Global Feature Selection
    Venkateswara, Hemanth
    Lade, Prasanth
    Lin, Binbin
    Ye, Jieping
    Panchanathan, Sethuraman
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 1009 - 1014
  • [3] Efficient Top-K Feature Selection Using Coordinate Descent Method
    Xu, Lei
    Wang, Rong
    Nie, Feiping
    Li, Xuelong
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 9, 2023, : 10594 - 10601
  • [4] Efficient Top-k Approximate Subtree Matching in Small Memory
    Augsten, Nikolaus
    Barbosa, Denilson
    Boehlen, Michael M.
    Palpanas, Themis
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (08) : 1123 - 1137
  • [5] Efficient Compressed Indexing for Approximate Top-k String Retrieval
    Ferrada, Hector
    Navarro, Gonzalo
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, SPIRE 2014, 2014, 8799 : 18 - 30
  • [6] APPROXIMATE CONSISTENT WEIGHTED SAMPLING FOR EFFICIENT TOP-K SEARCH
    Kim, Yunna
    Hwang, Heasoo
    [J]. INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2020, 16 (03): : 1125 - 1132
  • [7] Communication Efficient Algorithms for Top-k Selection Problems
    Huebschle-Schneider, Lorenz
    Sanders, Peter
    [J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, : 659 - 668
  • [8] A Feature Selection Algorithm Based on Approximate Markov Blanket and Dynamic Mutual Information
    Wang, Xiaodan
    Yao, Xu
    Zhang, Yuxi
    Lei, Lei
    [J]. INTELLIGENT SCIENCE AND INTELLIGENT DATA ENGINEERING, ISCIDE 2011, 2012, 7202 : 226 - 233
  • [9] Approximate distributed top-k queries
    Boaz Patt-Shamir
    Allon Shafrir
    [J]. Distributed Computing, 2008, 21 : 1 - 22
  • [10] An efficient top-k ranking method for service selection based on ε-ADMOPSO algorithm
    Yu, Wei
    Li, Shijun
    Tang, Xiaoyue
    Wang, Kai
    [J]. NEURAL COMPUTING & APPLICATIONS, 2019, 31 (Suppl 1): : 77 - 92