Revealing top-k dominant individuals in incomplete data based on spark environment

被引:0
|
作者
Wang, Ke [1 ]
Cui, Binge [1 ]
Lin, Jerry Chun-Wei [2 ]
Wu, Jimmy Ming-Tai [1 ]
机构
[1] Shandong Univ Sci & Technol, Coll Comp Sci & Engn, Qingdao, Peoples R China
[2] Western Norway Univ Appl Sci, Dept Comp Sci Elect Engn & Math Sci, Bergen, Norway
关键词
Incomplete dataset; Top-k dominance query; MapReduce; Spark; QUERIES;
D O I
10.1007/s10668-022-02652-5
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Incomplete data set is a new type of data set that arises due to various reasons. For example, when performing data transmission, some data are lost due to abnormal signal interruptions; when acquiring gene expression profile data, dust on gene chips and other reasons can also lead to the final acquired data being incomplete. Top-k dominance (TKD) query returns the k data with the largest dominance score in a given dataset. For large scale incomplete datasets with missing data in unknown dimensions, most of the research is based on the Hadoop MapReduce framework, but the algorithm performance is poor because the Hadoop MapReduce computing framework is not good at multi-task iterative computing and has a long start-up time, etc. The Spark framework is a more efficient data processing framework with a rich computational model and in-memory based implementation of data processing. Based on the above analysis, this paper proposes a query algorithm (Spark_TKD) based on Spark framework, which designs a simple object dominating number calculation method, greatly reducing the computational complexity and the interaction of data between cluster nodes, and reducing disk I/O operations. At the end of the paper, comparison experiments are conducted using real and synthetic datasets, and the experimental results show that our proposed algorithm exhibits better performance in terms of time consumption and disk footprint.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] Top-k Dominating Queries on Incomplete Data
    Miao, Xiaoye
    Gao, Yunjun
    Zheng, Baihua
    Chen, Gang
    Cui, Huiyong
    [J]. 2016 32ND IEEE INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2016, : 1500 - 1501
  • [2] Top-k Dominating Queries on Incomplete Data
    Miao, Xiaoye
    Gao, Yunjun
    Zheng, Baihua
    Chen, Gang
    Cui, Huiyong
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2016, 28 (01) : 252 - 266
  • [3] Weighted top-k dominating queries on highly incomplete data
    Fattah, H. M. Abdul
    Hasan, K. M. Azharul
    Tsuji, Tatsuo
    [J]. INFORMATION SYSTEMS, 2022, 107
  • [4] Continuous Top-k Dominating Query of Incomplete Data over Data Streams
    Santoso, Bagus Jati
    Permadi, Vynska Amalia
    Ahmad, Tohari
    Ijtihadie, Royyana Muslim
    Sektiaji, Bayu
    [J]. PROCEEDINGS OF 2018 3RD INTERNATIONAL CONFERENCE ON SUSTAINABLE INFORMATION ENGINEERING AND TECHNOLOGY (SIET 2018), 2018, : 21 - 26
  • [5] Mining Top-K Sequential Patterns in the Data Stream Environment
    Dai, Bi-Ru
    Jiang, Hung-Lin
    Chung, Chih-Heng
    [J]. INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI 2010), 2010, : 142 - 149
  • [6] A top-k spatial join querying processing algorithm based on spark
    Qiao, Baiyou
    Hu, Bing
    Zhu, Junhai
    Wu, Gang
    Giraud-Carrier, Christophe
    Wang, Guoren
    [J]. INFORMATION SYSTEMS, 2020, 87
  • [7] SRJA:A Research on Optimizing Top-k Join Queries Based on Spark
    Ren, Hui
    Fu, Haidong
    Xu, Fangfang
    Gu, Jinguang
    Zhao, Di
    [J]. PROCEEDINGS OF THE 2017 12TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2017, : 1000 - 1005
  • [8] Effective and efficient top-k query processing over incomplete data streams
    Ren, Weilong
    Lian, Xiang
    Ghazinour, Kambiz
    [J]. INFORMATION SCIENCES, 2021, 544 : 343 - 371
  • [9] TopCrowd - Efficient Crowd-enabled Top-k Retrieval on Incomplete Data
    Nieke, Christian
    Guentzer, Ulrich
    Balke, Wolf-Tilo
    [J]. CONCEPTUAL MODELING, 2014, 8824 : 122 - 135
  • [10] Top-k dominating queries on incomplete large dataset
    Jimmy Ming-Tai Wu
    Min Wei
    Mu-En Wu
    Shahab Tayeb
    [J]. The Journal of Supercomputing, 2022, 78 : 3976 - 3997