Fast Scalable Selection Algorithms for Large Scale Data

被引：0

作者：

Thompson, Lee Parnell ^{[1
]}

Xu, Weijia ^{[2
]}

Miranker, Daniel P. ^{[1
]}

机构：

[1] Univ Texas Austin, Dept Comp Sci, Austin, TX 78712 USA

[2] Univ Texas Austin, Texas Adv Comp Ctr, Austin, TX 78712 USA

来源：

2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA | 2013年

基金：

美国国家卫生研究院;

关键词：

Hadoop; Map Reduce; Selection Algorithms; Median Finding; PARALLEL SELECTION;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Selection finding, and its most common form median finding, are used as a measure of central tendency for problems in biology, databases, and graphics. These problems often require selection finding as a subcomponent where it can be called many times, and as such speed is important. The Map/Reduce framework has been shown to be an important tool for creating scalable applications. There are a number of valid implementations of the selection algorithms inside of a Map/Reduce framework, certain of which are compared in this paper. However, as the volume of data increases, subtle theoretical algorithmic implementation differences can lead to significant differences in practical application. Therefore, an efficient and scalable selection finding method has the potential to provide general benefit to a number of applications. This paper compares algorithms that have been redesigned or created for the Map/Reduce framework for the purpose of selection finding, or, finding the k-th ranked element in an unordered set. This paper takes the concepts used from two existing selection algorithms and translates them into a novel method using the Map/Reduce framework with two variations. Each approach uses a different methodology to reduce the total amount of workload needed for a selection. All the algorithms are compared together for scalability and efficiency in a computing cluster environment with up to 256 processing cores. The results show that the methods proposed in this paper outperform several common alternatives in identifying medians with Hadoop, including using sorting, Pig, and BinMedian methods. Our implementations are also available upon request.

引用

页数：9

共 50 条

[1] Scalable Algorithms for Bayesian Inference of Large-Scale Models from Large-Scale Data
Ghattas, Omar
Isaac, Tobin
Petra, Noemi
Stadler, Georg
[J]. HIGH PERFORMANCE COMPUTING FOR COMPUTATIONAL SCIENCE - VECPAR 2016, 2017, 10150 : 3 - 6
[2] Robust, Scalable, and Fast Bootstrap Method for Analyzing Large Scale Data
Basiri, Shahab
Ollila, Esa
Koivunen, Visa
[J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2016, 64 (04) : 1007 - 1017
[3] Fast and scalable selection algorithms with applications to median filtering
Wu, CH
Horng, SJ
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2003, 14 (10) : 983 - 992
[4] Fast Dual Selection using Genetic Algorithms for Large Data Sets
Ros, Frederic
Harba, Rachid
Pintore, Marco
[J]. 2012 12TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS (ISDA), 2012, : 815 - 820
[5] Fast and scalable support vector clustering for large-scale data analysis
Ping, Yuan
Chang, Yun Feng
Zhou, Yajian
Tian, Ying Jie
Yang, Yi Xian
Zhang, Zhili
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 43 (02) : 281 - 310
[6] Fast Algorithms for Optimal Link Selection in Large-Scale Network Monitoring
Kallitsis, Michael G.
Stoev, Stilian A.
Michailidis, George
[J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2013, 61 (08) : 2088 - 2103
[7] Fast and scalable support vector clustering for large-scale data analysis
Yuan Ping
Yun Feng Chang
Yajian Zhou
Ying Jie Tian
Yi Xian Yang
Zhili Zhang
[J]. Knowledge and Information Systems, 2015, 43 : 281 - 310
[8] Algorithms for fast large scale data mining using logistic regression
Rouhani-Kalleh, Omid
[J]. 2007 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DATA MINING, VOLS 1 AND 2, 2007, : 155 - 162
[9] Scalable Algorithms for Large Competing Risks Data
Kawaguchi, Eric S.
Shen, Jenny, I
Suchard, Marc A.
Li, Gang
[J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2021, 30 (03) : 685 - 693
[10] Towards Scalable Prototype Selection by Genetic Algorithms with Fast Criteria
Plasencia-Calana, Yenisel
Orozco-Alzate, Mauricio
Mendez-Vazquez, Heydi
Garcia-Reyes, Edel
Duin, Robert P. W.
[J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, 2014, 8621 : 343 - 352

← 1 2 3 4 5 →