Fast Scalable Selection Algorithms for Large Scale Data

被引:0
|
作者
Thompson, Lee Parnell [1 ]
Xu, Weijia [2 ]
Miranker, Daniel P. [1 ]
机构
[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA
[2] Univ Texas Austin, Texas Adv Comp Ctr, Austin, TX 78712 USA
基金
美国国家卫生研究院;
关键词
Hadoop; Map Reduce; Selection Algorithms; Median Finding; PARALLEL SELECTION;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Selection finding, and its most common form median finding, are used as a measure of central tendency for problems in biology, databases, and graphics. These problems often require selection finding as a subcomponent where it can be called many times, and as such speed is important. The Map/Reduce framework has been shown to be an important tool for creating scalable applications. There are a number of valid implementations of the selection algorithms inside of a Map/Reduce framework, certain of which are compared in this paper. However, as the volume of data increases, subtle theoretical algorithmic implementation differences can lead to significant differences in practical application. Therefore, an efficient and scalable selection finding method has the potential to provide general benefit to a number of applications. This paper compares algorithms that have been redesigned or created for the Map/Reduce framework for the purpose of selection finding, or, finding the k-th ranked element in an unordered set. This paper takes the concepts used from two existing selection algorithms and translates them into a novel method using the Map/Reduce framework with two variations. Each approach uses a different methodology to reduce the total amount of workload needed for a selection. All the algorithms are compared together for scalability and efficiency in a computing cluster environment with up to 256 processing cores. The results show that the methods proposed in this paper outperform several common alternatives in identifying medians with Hadoop, including using sorting, Pig, and BinMedian methods. Our implementations are also available upon request.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Scalable Algorithms for Bayesian Inference of Large-Scale Models from Large-Scale Data
    Ghattas, Omar
    Isaac, Tobin
    Petra, Noemi
    Stadler, Georg
    [J]. HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2016, 2017, 10150 : 3 - 6
  • [2] Robust, Scalable, and Fast Bootstrap Method for Analyzing Large Scale Data
    Basiri, Shahab
    Ollila, Esa
    Koivunen, Visa
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2016, 64 (04) : 1007 - 1017
  • [3] Fast and scalable selection algorithms with applications to median filtering
    Wu, CH
    Horng, SJ
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2003, 14 (10) : 983 - 992
  • [4] Fast Dual Selection using Genetic Algorithms for Large Data Sets
    Ros, Frederic
    Harba, Rachid
    Pintore, Marco
    [J]. 2012 12TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS (ISDA), 2012, : 815 - 820
  • [5] Fast and scalable support vector clustering for large-scale data analysis
    Ping, Yuan
    Chang, Yun Feng
    Zhou, Yajian
    Tian, Ying Jie
    Yang, Yi Xian
    Zhang, Zhili
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 43 (02) : 281 - 310
  • [6] Fast Algorithms for Optimal Link Selection in Large-Scale Network Monitoring
    Kallitsis, Michael G.
    Stoev, Stilian A.
    Michailidis, George
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2013, 61 (08) : 2088 - 2103
  • [7] Fast and scalable support vector clustering for large-scale data analysis
    Yuan Ping
    Yun Feng Chang
    Yajian Zhou
    Ying Jie Tian
    Yi Xian Yang
    Zhili Zhang
    [J]. Knowledge and Information Systems, 2015, 43 : 281 - 310
  • [8] Algorithms for fast large scale data mining using logistic regression
    Rouhani-Kalleh, Omid
    [J]. 2007 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DATA MINING, VOLS 1 AND 2, 2007, : 155 - 162
  • [9] Scalable Algorithms for Large Competing Risks Data
    Kawaguchi, Eric S.
    Shen, Jenny, I
    Suchard, Marc A.
    Li, Gang
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2021, 30 (03) : 685 - 693
  • [10] Towards Scalable Prototype Selection by Genetic Algorithms with Fast Criteria
    Plasencia-Calana, Yenisel
    Orozco-Alzate, Mauricio
    Mendez-Vazquez, Heydi
    Garcia-Reyes, Edel
    Duin, Robert P. W.
    [J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, 2014, 8621 : 343 - 352