Feature ranking based on information gain for large classification problems with MapReduce

被引:13
|
作者
Zdravevski, Eftim [1 ]
Lameski, Petre [1 ]
Kulakov, Andrea [1 ]
Jakimovski, Boro [1 ]
Filiposka, Sonja [1 ]
Trajanov, Dimitar [1 ]
机构
[1] Ss Cyril & Methodius Univ, Fac Comp Sci & Engn, Skopje, Macedonia
关键词
Hadoop; HBase; MapReduce; information gain; parallelization; feature ranking; SELECTION;
D O I
10.1109/Trustcom.2015.580
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In classification problems the large number of features can pose a significant challenge from many aspects. This is particularly the case in the context of Big Data. In order to address this issue we propose a distributed and parallel computation of information gain based on MapReduce. The proposed implementation on Hadoop can be used for ranking features of large datasets and furthermore for feature selection. The data-parallelism is achieved by uniformly distributing it using HBase tables with proper row keys. Performance evaluations are made by estimation of the speed-up of multi-node clusters against a one-node cluster. The framework was deployed on a on-premises Hadoop cluster. The results show that by parallelization and distribution of the computations on a cluster significant speedup can be achieved. The main contribution of this paper is that we have demonstrated how the higher level scripting language Pig Latin can be used for writing MapReduce jobs instead of directly writing a separate map and reduce function. Additionally, we have proposed the use of manually pre-splitted HBase tables instead of HDFS files for data fragmentation in order to set the degree of parallelism on a higher level.
引用
收藏
页码:186 / 191
页数:6
相关论文
共 50 条
  • [1] Genetic Programming for Feature Ranking in Classification Problems
    Neshatian, Kourosh
    Zhang, Mengjie
    Andreae, Peter
    [J]. SIMULATED EVOLUTION AND LEARNING, PROCEEDINGS, 2008, 5361 : 544 - 554
  • [2] On the Feature Selection and Classification Based on Information Gain for Document Sentiment Analysis
    Pratiwi, Asriyanti Indah
    Adiwijaya
    [J]. APPLIED COMPUTATIONAL INTELLIGENCE AND SOFT COMPUTING, 2018, 2018
  • [3] A Developed Feature Selection Method for Classification Based on United Information Gain
    Niu, Kun
    Jiao, Haizhen
    Gao, Zhipeng
    Jia, Guannan
    Yang, Guangyu
    Cheng, Cheng
    [J]. 2017 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTED, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2017,
  • [4] Mutual information based input feature selection for classification problems
    Cang, Shuang
    Yu, Hongnian
    [J]. DECISION SUPPORT SYSTEMS, 2012, 54 (01) : 691 - 698
  • [5] MapReduce Based Information Retrieval Algorithms for Efficient Ranking of Webpages
    Srinivasa, K. G.
    Muppalla, Anil Kumar
    Varun, Bharghava A.
    Amulya, M.
    [J]. INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH, 2011, 1 (04) : 23 - 37
  • [6] Genetic Programming for Feature Subset Ranking in Binary Classification Problems
    Neshatian, Kourosh
    Zhang, Mengjie
    [J]. GENETIC PROGRAMMING, 2009, 5481 : 121 - 132
  • [7] Feature extraction based on information gain and sequential pattern for English question classification
    Liu, Yaqing
    Yi, Xiaokai
    Chen, Rong
    Zhai, Zhengguo
    Gu, Jingxuan
    [J]. IET SOFTWARE, 2018, 12 (06) : 520 - 526
  • [8] Analysis of Feature Weighting Methods Based on Feature Ranking Methods for Classification
    Jankowski, Norbert
    Usowicz, Krzysztof
    [J]. NEURAL INFORMATION PROCESSING, PT II, 2011, 7063 : 238 - 247
  • [9] Data Shrinking Based Feature Ranking for Protein Classification
    Dua, Sumeet
    Saini, Sheetal
    [J]. INFORMATION SYSTEMS, TECHNOLOGY AND MANAGEMENT-THIRD INTERNATIONAL CONFERENCE, ICISTM 2009, 2009, 31 : 54 - 63
  • [10] Correlation-based Feature Ranking for Online Classification
    Osman, Hassab Elgawi
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2009), VOLS 1-9, 2009, : 3077 - 3082