Stream-based live entity resolution approach with adaptive duplicate count strategy

被引:6
|
作者
Ma, Kun [1 ]
Yang, Bo [1 ]
机构
[1] Univ Jinan, Shandong Prov Key Lab Network Based Intelligent C, Jinan 250022, Shandong, Peoples R China
基金
奥地利科学基金会;
关键词
big data; cloud computing; entity resolution; MapReduce; NoSQL; sorted neighbourhood; stream processing; RECORD; CACHE;
D O I
10.1504/IJWGS.2017.10006055
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, researchers have been more concerned about large-scale news and tweet data generated by the social media. Some cloud service providers utilise the data to find public sentiments for the tenants. The challenge is how to clean the big data in the cloud before making further analysis. To address this issue, we propose a new live entity resolution approach at a time to find duplicates from the news and tweet data. We investigate possible solutions to address live entity resolution in the cloud, to make sliding window size adaptive using multistep distance and window size dependent duplicate count strategy with alterable window step, and find duplicates by overlapping boundary objects in adjacent blocks. Finally, our experimental evaluation based on the news data on large datasets shows the high effectiveness and efficiency of the proposed approaches.
引用
收藏
页码:351 / 373
页数:23
相关论文
共 50 条
  • [41] Unsupervised Real-Time Stream-Based Novelty Detection Technique An Approach in a Corporate Cloud
    Vergeles, Anna
    Khaya, Alexander
    Prokopenko, Dmytro
    Manakova, Nataliia
    2018 IEEE SECOND INTERNATIONAL CONFERENCE ON DATA STREAM MINING & PROCESSING (DSMP), 2018, : 166 - 170
  • [42] A Data Stream-Based, Integrative Approach to Reliable and Easily Manageable Real Time Environmental Monitoring
    Jiang, Meilan
    Lee, Jonghyun
    Jeong, Karpjoo
    Cui, Zhenguo
    Kim, Bomchul
    Hwang, Suntae
    Choi, Young Jean
    INTERNATIONAL JOURNAL OF DISTRIBUTED SENSOR NETWORKS, 2015,
  • [43] TEXT CLASSIFICATION STREAM-BASED R-MEASURE APPROACH USING FREQUENCY OF SUBSTRING REPETITION
    Ashurov, M. F.
    Poddubny, V. V.
    VESTNIK TOMSKOGO GOSUDARSTVENNOGO UNIVERSITETA-UPRAVLENIE VYCHISLITELNAJA TEHNIKA I INFORMATIKA-TOMSK STATE UNIVERSITY JOURNAL OF CONTROL AND COMPUTER SCIENCE, 2015, 33 (04): : 4 - 12
  • [44] A Stream-Based Methane Monitoring Approach for Evaluating Groundwater Impacts Associated with Unconventional Gas Development
    Heilweil, Victor M.
    Stolp, Bert J.
    Kimball, Briant A.
    Susong, David D.
    Marston, Thomas M.
    Gardner, Philip M.
    GROUND WATER, 2013, 51 (04) : 511 - 524
  • [45] Adaptive Connection Strength Models for Relationship-Based Entity Resolution
    Nuray-Turan, Rabia
    Kalashnikov, Dmitri V.
    Mehrotra, Sharad
    ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2013, 4 (02):
  • [46] Reconfiguration for Sensitivity Technique: A QoS-aware Co-Design approach for stream-based applications
    Adeluyi, Olufemi
    Lee, Jeong-A
    IEICE ELECTRONICS EXPRESS, 2010, 7 (24): : 1766 - 1772
  • [47] An Ontology-Based Approach for Product Entity Resolution on the Web
    Vermaas, Raymond
    Vandic, Damir
    Frasincar, Flavius
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2014, PT I, 2014, 8786 : 534 - 543
  • [48] A genetic algorithm based entity resolution approach with active learning
    Sun, Chenchen
    Shen, Derong
    Kou, Yue
    Nie, Tiezheng
    Yu, Ge
    FRONTIERS OF COMPUTER SCIENCE, 2017, 11 (01) : 147 - 159
  • [49] An ontology-based approach for product entity resolution on the web
    Vermaas, Raymond
    Vandic, Damir
    Frasincar, Flavius
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8786 : 534 - 543
  • [50] A genetic algorithm based entity resolution approach with active learning
    Chenchen Sun
    Derong Shen
    Yue Kou
    Tiezheng Nie
    Ge Yu
    Frontiers of Computer Science, 2017, 11 : 147 - 159