Distributed Column Subset Selection on MapReduce

被引:17
|
作者
Farahat, Ahmed K. [1 ]
Elgohary, Ahmed [1 ]
Ghodsi, Ali [1 ]
Kamel, Mohamed S. [1 ]
机构
[1] Univ Waterloo, Waterloo, ON N2L 3G1, Canada
关键词
Column Subset Selection; Greedy Algorithms; Distributed Computing; Big Data; MapReduce; JOHNSON;
D O I
10.1109/ICDM.2013.155
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a very large data set distributed over a cluster of several nodes, this paper addresses the problem of selecting a few data instances that best represent the entire data set. The solution to this problem is of a crucial importance in the big data era as it enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. The paper first formulates the problem as the selection of a few representative columns from a matrix whose columns are massively distributed, and it then proposes a MapReduce algorithm for selecting those representatives. The algorithm first learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper then demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.
引用
收藏
页码:171 / 180
页数:10
相关论文
共 50 条
  • [1] A Distributed Integrated Feature Selection Scheme for Column Subset Selection
    Xiao, Zheng
    Wei, PengCheng
    Chronopoulos, Anthony Theodore
    Elster, Anne C. C.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (03) : 2193 - 2205
  • [2] Streaming and Distributed Algorithms for Robust Column Subset Selection
    Jiang, Shuli
    Li, Dongyu
    Li, Irene Mengze
    Mahankali, Arvind, V
    Woodruff, David P.
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [3] Greedy Column Subset Selection: New Bounds and Distributed Algorithms
    Altschuler, Jason
    Bhaskara, Aditya
    Fu, Gang
    Mirrokni, Vahab
    Rostamizadeh, Afshin
    Zadimoghaddam, Morteza
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [4] Iterative column subset selection
    Bruno Ordozgoiti
    Sandra Gómez Canaval
    Alberto Mozo
    Knowledge and Information Systems, 2018, 54 : 65 - 94
  • [5] Iterative column subset selection
    Ordozgoiti, Bruno
    Gomez Canaval, Sandra
    Mozo, Alberto
    KNOWLEDGE AND INFORMATION SYSTEMS, 2018, 54 (01) : 65 - 94
  • [6] A Note on Column Subset Selection
    Youssef, Pierre
    INTERNATIONAL MATHEMATICS RESEARCH NOTICES, 2014, 2014 (23) : 6431 - 6447
  • [7] Regularized greedy column subset selection
    Ordozgoiti, Bruno
    Mozo, Alberto
    Garcia Lopez de Lacalle, Jesus
    INFORMATION SCIENCES, 2019, 486 : 393 - 418
  • [8] A Comparison of Column Subset Selection Methods for Unsupervised Band Subset Selection in Hyperspectral Imagery
    Aldeghlawi, Maher
    Velez-Reyes, Miguel
    2018 IEEE SOUTHWEST SYMPOSIUM ON IMAGE ANALYSIS AND INTERPRETATION (SSIAI), 2018, : 57 - 60
  • [9] Distributed Pareto Optimization for Subset Selection
    Qian, Chao
    Li, Guiying
    Feng, Chao
    Tang, Ke
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1492 - 1498
  • [10] A determinantal point process for column subset selection
    Belhadji, Ayoub
    Bardenet, Rémi
    Chainais, Pierre
    Journal of Machine Learning Research, 2020, 21