Stream-based live entity resolution approach with adaptive duplicate count strategy

被引:6
|
作者
Ma, Kun [1 ]
Yang, Bo [1 ]
机构
[1] Univ Jinan, Shandong Prov Key Lab Network Based Intelligent C, Jinan 250022, Shandong, Peoples R China
基金
奥地利科学基金会;
关键词
big data; cloud computing; entity resolution; MapReduce; NoSQL; sorted neighbourhood; stream processing; RECORD; CACHE;
D O I
10.1504/IJWGS.2017.10006055
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, researchers have been more concerned about large-scale news and tweet data generated by the social media. Some cloud service providers utilise the data to find public sentiments for the tenants. The challenge is how to clean the big data in the cloud before making further analysis. To address this issue, we propose a new live entity resolution approach at a time to find duplicates from the news and tweet data. We investigate possible solutions to address live entity resolution in the cloud, to make sliding window size adaptive using multistep distance and window size dependent duplicate count strategy with alterable window step, and find duplicates by overlapping boundary objects in adjacent blocks. Finally, our experimental evaluation based on the news data on large datasets shows the high effectiveness and efficiency of the proposed approaches.
引用
收藏
页码:351 / 373
页数:23
相关论文
共 50 条
  • [31] Data Life Aware Model Updating Strategy for Stream-Based Online Deep Learning
    Rang, Wei
    Yang, Donglin
    Cheng, Dazhao
    Wang, Yu
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (10) : 2571 - 2581
  • [32] Large-Scale DNA Sequence Analysis in the Cloud: A Stream-Based Approach
    Kienzler, Romeo
    Bruggmann, Remy
    Ranganathan, Anand
    Tatbul, Nesime
    EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT II, 2012, 7156 : 467 - 476
  • [33] ADADRIFT: An Adaptive Learning Technique for Long-history Stream-based Recommender Systems
    Jose, Eduardo Ferreira
    Enembreck, Fabricio
    Barddal, Jean Paul
    2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2020, : 2593 - 2600
  • [34] Stream-Based Lossless Data Compression Applying Adaptive Entropy Coding for Hardware-Based Implementation
    Yamagiwa, Shinichi
    Hayakawa, Eisaku
    Marumo, Koichi
    ALGORITHMS, 2020, 13 (07)
  • [35] Detection of RFID cloning attacks: A spatiotemporal trajectory data stream-based practical approach
    Feng, Yue
    Huang, Weiqing
    Wang, Siye
    Zhang, Yanfang
    Jiang, Shang
    COMPUTER NETWORKS, 2021, 189
  • [36] A traffic monitoring stream-based real-time vehicular offence detection approach
    Liu, Ying
    Ou, Guoyu
    JOURNAL OF INTELLIGENT TRANSPORTATION SYSTEMS, 2018, 22 (01) : 53 - 64
  • [37] Autonomous Parameter Adjustment Method for Lossless Data Compression on Adaptive Stream-Based Entropy Coding
    Yamagiwa, Shinichi
    Kuwabara, Suzukaze
    IEEE ACCESS, 2020, 8 : 186890 - 186903
  • [38] Scalable adaptive optimizations for stream-based workflows in multi-HPC-clusters and cloud infrastructures
    Liang, Liang
    Filgueira, Rosa
    Yan, Yan
    Heinis, Thomas
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2022, 128 : 102 - 116
  • [39] Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline
    Wang, Xiao
    Wang, Shiao
    Tang, Chuanming
    Zhu, Lin
    Jiang, Bo
    Tian, Yonghong
    Tang, Jin
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 19248 - 19257
  • [40] Deep Learning Based Approach for Entity Resolution in Databases
    Kooli, Nihel
    Allesiardo, Robin
    Pigneul, Erwan
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2018, PT II, 2018, 10752 : 3 - 12