Incremental processing for string similarity join

被引:1
|
作者
Yan, Cairong [1 ]
Zhu, Bin [1 ]
Gan, Yanglan [1 ]
Xu, Guangwei [1 ]
机构
[1] Donghua Univ, Sch Comp Sci & Technol, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
string similarity join; incremental processing; parallel processing; string matching; Spark; computational science; engineering;
D O I
10.1504/IJCSE.2019.103780
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
String similarity join is an essential operation of data quality management and a key step to find the value of data. An incremental processing framework for string similarity join is proposed in this paper. Compared with the batching processing model, it can avoid the heavy time cost and the space cost brought by the duplicate similarity computation among historical strings and is suitable for processing data streams. We implement two algorithms: Inc-Join and Inp-Join. Inc-Join runs on a stand-alone machine while Inp-Join runs on a cluster with Spark environment. The experimental results show that this incremental processing framework can reduce the amount of string matching without affecting the join accuracy. When the data quantity becomes large, Inp-Join can make full use of the advantage of parallel processing and obtain a better performance than Inc-Join.
引用
收藏
页码:255 / 268
页数:14
相关论文
共 50 条
  • [1] Efficient and Scalable Processing of String Similarity Join
    Rong, Chuitian
    Lu, Wei
    Wang, Xiaoli
    Du, Xiaoyong
    Chen, Yueguo
    Tung, Anthony K. H.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (10) : 2217 - 2230
  • [2] String similarity search and join: a survey
    Yu, Minghe
    Li, Guoliang
    Deng, Dong
    Feng, Jianhua
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2016, 10 (03) : 399 - 417
  • [3] String similarity search and join: a survey
    Minghe Yu
    Guoliang Li
    Dong Deng
    Jianhua Feng
    [J]. Frontiers of Computer Science, 2016, 10 : 399 - 417
  • [4] Parallelizing String Similarity Join Algorithms
    Yao, Ling-Chih
    Lim, Lipyeow
    [J]. DATABASES THEORY AND APPLICATIONS, ADC 2018, 2018, 10837 : 322 - 327
  • [5] String Similarity Join with Different Thresholds
    Rong, Chuitian
    Zhang, Xiangling
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2015, 2015, 9403 : 260 - 271
  • [6] String similarity search and join:a survey
    Minghe YU
    Guoliang LI
    Dong DENG
    Jianhua FENG
    [J]. Frontiers of Computer Science., 2016, 10 (03) - 417
  • [7] Hashed-Join: Approximate String Similarity Join with Hashing
    Yuan, Peisen
    Sha, Chaofeng
    Sun, Yi
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, 2014, 8505 : 217 - 229
  • [8] LS-Join: Local Similarity Join on String Collections
    Wang, Jiaying
    Yang, Xiaochun
    Wang, Bin
    Liu, Chengfei
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (09) : 1928 - 1942
  • [9] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
    Gouda, Karam
    Rashad, Metwally
    [J]. COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704
  • [10] State-of-the-art in String Similarity Search and Join
    Wandelt, Sebastian
    Deng, Dong
    Gerdjikov, Stefan
    Mishra, Shashwat
    Mitankin, Petar
    Patil, Manish
    Siragusa, Enrico
    Tiskin, Alexander
    Wang, Wei
    Wang, Jiaying
    Leser, Ulf
    [J]. SIGMOD RECORD, 2014, 43 (01) : 64 - 76