Revealing top-k dominant individuals in incomplete data based on spark environment

被引：0

作者：

Wang, Ke ^{[1
]}

Cui, Binge ^{[1
]}

Lin, Jerry Chun-Wei ^{[2
]}

Wu, Jimmy Ming-Tai ^{[1
]}

机构：

[1] Shandong Univ Sci & Technol, Coll Comp Sci & Engn, Qingdao, Peoples R China

[2] Western Norway Univ Appl Sci, Dept Comp Sci Elect Engn & Math Sci, Bergen, Norway

来源：

ENVIRONMENT DEVELOPMENT AND SUSTAINABILITY | 2022年

关键词：

Incomplete dataset; Top-k dominance query; MapReduce; Spark; QUERIES;

D O I：

10.1007/s10668-022-02652-5

中图分类号：

X [环境科学、安全科学];

学科分类号：

08 ; 0830 ;

摘要：

Incomplete data set is a new type of data set that arises due to various reasons. For example, when performing data transmission, some data are lost due to abnormal signal interruptions; when acquiring gene expression profile data, dust on gene chips and other reasons can also lead to the final acquired data being incomplete. Top-k dominance (TKD) query returns the k data with the largest dominance score in a given dataset. For large scale incomplete datasets with missing data in unknown dimensions, most of the research is based on the Hadoop MapReduce framework, but the algorithm performance is poor because the Hadoop MapReduce computing framework is not good at multi-task iterative computing and has a long start-up time, etc. The Spark framework is a more efficient data processing framework with a rich computational model and in-memory based implementation of data processing. Based on the above analysis, this paper proposes a query algorithm (Spark_TKD) based on Spark framework, which designs a simple object dominating number calculation method, greatly reducing the computational complexity and the interaction of data between cluster nodes, and reducing disk I/O operations. At the end of the paper, comparison experiments are conducted using real and synthetic datasets, and the experimental results show that our proposed algorithm exhibits better performance in terms of time consumption and disk footprint.

引用

页数：21

共 50 条

[1] Top-k Dominating Queries on Incomplete Data
Miao, Xiaoye
Gao, Yunjun
Zheng, Baihua
Chen, Gang
Cui, Huiyong
[J]. 2016 32ND IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2016, : 1500 - 1501
[2] Top-k Dominating Queries on Incomplete Data
Miao, Xiaoye
Gao, Yunjun
Zheng, Baihua
Chen, Gang
Cui, Huiyong
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (01) : 252 - 266
[3] Weighted top-k dominating queries on highly incomplete data
Fattah, H. M. Abdul
Hasan, K. M. Azharul
Tsuji, Tatsuo
[J]. INFORMATION SYSTEMS, 2022, 107
[4] Continuous Top-k Dominating Query of Incomplete Data over Data Streams
Santoso, Bagus Jati
Permadi, Vynska Amalia
Ahmad, Tohari
Ijtihadie, Royyana Muslim
Sektiaji, Bayu
[J]. PROCEEDINGS OF 2018 3RD INTERNATIONAL CONFERENCE ON SUSTAINABLE INFORMATION ENGINEERING AND TECHNOLOGY (SIET 2018), 2018, : 21 - 26
[5] Mining Top-K Sequential Patterns in the Data Stream Environment
Dai, Bi-Ru
Jiang, Hung-Lin
Chung, Chih-Heng
[J]. INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI 2010), 2010, : 142 - 149
[6] A top-k spatial join querying processing algorithm based on spark
Qiao, Baiyou
Hu, Bing
Zhu, Junhai
Wu, Gang
Giraud-Carrier, Christophe
Wang, Guoren
[J]. INFORMATION SYSTEMS, 2020, 87
[7] SRJA:A Research on Optimizing Top-k Join Queries Based on Spark
Ren, Hui
Fu, Haidong
Xu, Fangfang
Gu, Jinguang
Zhao, Di
[J]. PROCEEDINGS OF THE 2017 12TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2017, : 1000 - 1005
[8] Effective and efficient top-k query processing over incomplete data streams
Ren, Weilong
Lian, Xiang
Ghazinour, Kambiz
[J]. INFORMATION SCIENCES, 2021, 544 : 343 - 371
[9] TopCrowd - Efficient Crowd-enabled Top-k Retrieval on Incomplete Data
Nieke, Christian
Guentzer, Ulrich
Balke, Wolf-Tilo
[J]. CONCEPTUAL MODELING, 2014, 8824 : 122 - 135
[10] Top-k dominating queries on incomplete large dataset
Jimmy Ming-Tai Wu
Min Wei
Mu-En Wu
Shahab Tayeb
[J]. The Journal of Supercomputing, 2022, 78 : 3976 - 3997

← 1 2 3 4 5 →