Distributed Column Subset Selection on MapReduce

被引:17
|
作者
Farahat, Ahmed K. [1 ]
Elgohary, Ahmed [1 ]
Ghodsi, Ali [1 ]
Kamel, Mohamed S. [1 ]
机构
[1] Univ Waterloo, Waterloo, ON N2L 3G1, Canada
来源
2013 IEEE 13TH INTERNATIONAL CONFERENCE ON DATA MINING (ICDM) | 2013年
关键词
Column Subset Selection; Greedy Algorithms; Distributed Computing; Big Data; MapReduce; JOHNSON;
D O I
10.1109/ICDM.2013.155
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given a very large data set distributed over a cluster of several nodes, this paper addresses the problem of selecting a few data instances that best represent the entire data set. The solution to this problem is of a crucial importance in the big data era as it enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. The paper first formulates the problem as the selection of a few representative columns from a matrix whose columns are massively distributed, and it then proposes a MapReduce algorithm for selecting those representatives. The algorithm first learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper then demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.
引用
收藏
页码:171 / 180
页数:10
相关论文
共 50 条
  • [21] Column Subset Selection, Matrix Factorization, and Eigenvalue Optimization
    Tropp, Joel A.
    PROCEEDINGS OF THE TWENTIETH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2009, : 978 - 986
  • [22] Equity Factor Analysis via Column Subset Selection
    Boutsidis, Christos
    Malioutov, Dmitry
    2013 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP), 2013, : 1131 - 1131
  • [23] Interlacing Polynomial Method for the Column Subset Selection Problem
    Cai, Jian-Feng
    Xu, Zhiqiang
    Xu, Zili
    INTERNATIONAL MATHEMATICS RESEARCH NOTICES, 2024, 2024 (09) : 7798 - 7819
  • [24] An Improved Approximation Algorithm for the Column Subset Selection Problem
    Boutsidis, Christos
    Mahoney, Michael W.
    Drineas, Petros
    PROCEEDINGS OF THE TWENTIETH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2009, : 968 - +
  • [25] Column Subset Selection Problem is UG-hard
    Civril, A.
    JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2014, 80 (04) : 849 - 859
  • [26] An Optimal Subset Selection Algorithm for Distributed Hypothesis Test
    Li, Jiarui
    Guo, Guangbao
    IAENG International Journal of Applied Mathematics, 2024, 54 (12) : 2811 - 2815
  • [27] Communication-efficient estimation for distributed subset selection
    Yan Chen
    Ruipeng Dong
    Canhong Wen
    Statistics and Computing, 2023, 33
  • [28] Communication-efficient estimation for distributed subset selection
    Chen, Yan
    Dong, Ruipeng
    Wen, Canhong
    STATISTICS AND COMPUTING, 2023, 33 (06)
  • [29] The COR criterion for optimal subset selection in distributed estimation
    Guo, Guangbao
    Song, Haoyue
    Zhu, Lixing
    STATISTICS AND COMPUTING, 2024, 34 (05)
  • [30] An Empirical Comparison of Sampling Techniques for Matrix Column Subset Selection
    Wang, Yining
    Singh, Aarti
    2015 53RD ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2015, : 1069 - 1074