Batch Text Similarity Search with MapReduce

被引:0
|
作者
Li, Rui [1 ,2 ,3 ]
Ju, Li [4 ]
Peng, Zhuo [1 ]
Yu, Zhiwei [5 ]
Wang, Chaokun [1 ,2 ,3 ]
机构
[1] Tsinghua Univ, Sch Software, Beijing 100084, Peoples R China
[2] Tsinghua Natl Lab Informat Sci & Technol, Beijing, Peoples R China
[3] Ministry Educ, Key Lab Informat Syst Secur, Beijing, Peoples R China
[4] Henan Coll Finance & Taxat, Dept Informat Engn, Zhengzhou 450002, Peoples R China
[5] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
MapReduce; Batch Text Similarity Search;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Batch text similarity search aims to find the similar texts according to users' batch text queries. It is widely used in the real world such as plagiarism check, and attracts more and more attention with the emergence of abundant texts on the web. Existing works, such as Fuzzy Join, can neither support the variation of thresholds, nor support the online batch text similarity search. In this paper, a two-stage algorithm is proposed. It can effectively resolve the problem of batch text similarity search based on inverted index structures. Experimental results on real datasets show the efficiency and expansibility of our method.
引用
收藏
页码:412 / +
页数:2
相关论文
共 50 条
  • [21] An Information Intelligent Search Method for Computer Forensics Based on Text Similarity
    Yang, Zhongxin
    Chen, Zhifeng
    Zhang, Ping
    Liu, Ming
    Li, Qingbao
    2020 4TH INTERNATIONAL CONFERENCE ON CRYPTOGRAPHY, SECURITY AND PRIVACY (ICCSP 2020), 2020, : 79 - 83
  • [22] Text Categorization via Similarity Search An Efficient and Effective Novel Algorithm
    Duan, Hubert Haoyang
    Pestov, Vladimir G.
    Singla, Varun
    SIMILARITY SEARCH AND APPLICATIONS (SISAP), 2013, 8199 : 182 - 193
  • [23] Multidimensional Similarity Join Using MapReduce
    Li, Ye
    Wang, Jian
    Hou, Leong U.
    WEB-AGE INFORMATION MANAGEMENT, PT II, 2016, 9659 : 457 - 468
  • [24] Metric Similarity Joins Using MapReduce
    Chen, Gang
    Yang, Keyu
    Chen, Lu
    Gao, Yunjun
    Zheng, Baihua
    Chen, Chun
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (03) : 656 - 669
  • [25] Towards Generalizable Semantic Product Search by Text Similarity Pre-training on Search Click Logs
    Liu, Zheng
    Zhang, Wei
    Chen, Yan
    Sun, Weiyi
    Du, Michael
    Schroeder, Benjamin
    PROCEEDINGS OF THE 5TH WORKSHOP ON E-COMMERCE AND NLP (ECNLP 5), 2022, : 224 - 233
  • [26] Parallel Text Clustering Based on MapReduce
    Cao Zewen
    Zhou Yao
    SECOND INTERNATIONAL CONFERENCE ON CLOUD AND GREEN COMPUTING / SECOND INTERNATIONAL CONFERENCE ON SOCIAL COMPUTING AND ITS APPLICATIONS (CGC/SCA 2012), 2012, : 226 - 229
  • [27] Image search optimization with web scraping, text processig and cosine similarity algorithms
    Ridwang
    Ilham, Amil Ahmad
    Nurtanio, Ingrid
    Syafaruddin
    2020 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION, NETWORKS AND SATELLITE (COMNETSAT), 2020, : 346 - 350
  • [28] Efficient Batch Processing of Proximity Queries with MapReduce
    Nam, GiWoong
    Kim, DongEun
    Lee, JongHyeok
    Youn, Hee Yong
    Kim, Ung-Mo
    ACM IMCOM 2015, Proceedings, 2015,
  • [29] Efficient EMD-Based Similarity Search via Batch Pruning and Incremental Computation
    Chen, Yu
    Zhang, Yong
    Wang, Jin
    Wu, Jiacheng
    Xing, Chunxiao
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (02) : 1446 - 1459
  • [30] Set Similarity Joins on MapReduce: An Experimental Survey
    Fier, Fabian
    Augsten, Nikolaus
    Bouros, Panagiotis
    Leser, Ulf
    Freytag, Johann-Christoph
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (10): : 1110 - 1122