MR-SimLab: Scalable subgraph selection with label similarity for big data

被引:19
|
作者
Dhifli, Wajdi [1 ]
Aridhi, Sabeur [2 ]
Nguifo, Engelbert Mephu [3 ]
机构
[1] Univ Evry Val Essonne, Inst Syst & Synthet Biol, F-91030 Evry, France
[2] Univ Lorraine, LORIA, F-54506 Vandoeuvre Les Nancy, France
[3] Univ Clermont Auvergne, CNRS, LIMOS, F-63000 Clermont Ferrand, France
关键词
Feature selection; Subgraph mining; Label similarity; MAPREDUCE; SEARCH;
D O I
10.1016/j.is.2017.05.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the increasing size and complexity of available databases, existing machine learning and data mining algorithms are facing a scalability challenge. In many applications, the number of features describing the data could be extremely high. This hinders or even could make any further exploration infeasible. In fact, many of these features are redundant or simply irrelevant. Hence, feature selection plays a key role in helping to overcome the problem of information overload especially in big data applications. Since many complex datasets could be modeled by graphs of interconnected labeled elements, in this work, we are particularly interested in feature selection for subgraph patterns. In this paper, we propose MR-SimLAB, a MAPREDucE-based approach for subgraph selection from large input subgraph sets. In many applications, it is easy to compute pairwise similarities between labels of the graph nodes. Our approach leverages such rich information to measure an approximate subgraph matching by aggregating the elementary label similarities between the matched nodes. Based on the aggregated similarity scores, our approach selects a small subset of informative representative subgraphs. We provide a distributed implementation of our algorithm on top of the MAPREDUCE framework that optimizes the computational efficiency of our approach for big data applications. We experimentally evaluate MR-SIMLAB on real datasets. The obtained results show that our approach is scalable and that the selected subgraphs are informative. (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:155 / 163
页数:9
相关论文
共 13 条
  • [1] Scalable and Accurate Online Feature Selection for Big Data
    Yu, Kui
    Wu, Xindong
    Ding, Wei
    Pei, Jian
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2016, 11 (02)
  • [2] Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data
    Rozinek, Ondrej
    Borkovcova, Monika
    Mares, Jan
    GOOD PRACTICES AND NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 6, WORLDCIST 2024, 2024, 990 : 181 - 191
  • [3] Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data
    Rozinek, Ondrej
    Borkovcova, Monika
    Mares, Jan
    Lecture Notes in Networks and Systems, 2024, 990 LNNS : 181 - 191
  • [4] Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics
    Rong, Chuitian
    Lin, Chunbin
    Silva, Yasin N.
    Wang, Jianguo
    Lu, Wei
    Du, Xiaoyong
    2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 1059 - 1070
  • [5] Towards Scalable and Accurate Online Feature Selection for Big Data
    Yu, Kui
    Wu, Xindong
    Ding, Wei
    Pei, Jian
    2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2014, : 660 - 669
  • [6] A Scalable Data Chunk Similarity Based Compression Approach for Efficient Big Sensing Data Processing on Cloud
    Yang, Chi
    Chen, Jinjun
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (06) : 1144 - 1157
  • [7] MR-DIS: democratic instance selection for big data by MapReduce
    Arnaiz-González Á.
    González-Rogel A.
    Díez-Pastor J.-F.
    López-Nozal C.
    Progress in Artificial Intelligence, 2017, 6 (3) : 211 - 219
  • [8] A Scalable Feature Selection and Model Updating Approach for Big Data Machine Learning
    Yang, Baijian
    Zhang, Tonglin
    2016 IEEE INTERNATIONAL CONFERENCE ON SMART CLOUD (SMARTCLOUD), 2016, : 146 - 151
  • [9] MR-TRIAGE: Scalable Multi-Criteria Clustering for Big Data Security Intelligence Applications
    Shen, Yun
    Thonnard, Olivier
    2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2014, : 627 - 635
  • [10] RMHC-MR: Instance selection by random mutation hill climbing algorithm with MapReduce in big data
    Si, Lu
    Yu, Jie
    Wu, Wuyang
    Ma, Jun
    Wu, Qingbo
    Li, Shasha
    8TH INTERNATIONAL CONFERENCE ON ADVANCES IN INFORMATION TECHNOLOGY, 2017, 111 : 252 - 259