Enhancing Target Document Search in CopyCatch: A Focus on Thai and English document

被引:0
|
作者
Klaithin, Supon [1 ]
Kasuriya, Sawit [1 ]
机构
[1] Natl Sci & Technol Dev Agcy NSTDA, Natl Elect & Comp Technol Ctr NECTEC, Khlong Luang, Thailand
关键词
plagiarism; document search; similarity; target document; document corpus;
D O I
10.1109/iSAI-NLP60301.2023.10354947
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
CopyCatch, is an automatic software designed to detect plagiarism in electronic documents written in both Thai and English. It consists of two main components: Target Document Search and Document Comparison and Similarity Calculation. The Target Document Search is an important step in identifying a list of target documents from the document corpus that are expected to have content similar to the document being examined. In this paper, we propose a novel approach to building the document corpus and comparing input documents with target documents by focusing on Thai vowels and English alphabets. The comparison process involves controlling three variables: passage size, window size, overlap of 3-gram chunks, and the frequency matching of 3-gram chunk in the document corpus. We found that setting the passage size to 10,000 characters, using a window size and overlap of 10:5 for 3-gram chunk, and applying a frequency matching condition of 1 had a substantial impact on the accuracy and number of files identified as similar.
引用
收藏
页数:6
相关论文
共 50 条
  • [31] Using Document Space For Relational Search
    Drake, Richard
    Pu, Ken Q.
    [J]. 2014 IEEE 15TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2014, : 841 - 844
  • [32] Dynamic context for document search and recovery
    Computing Department, CINVESTAV-IPN, Distrito Federal
    CP 07300, Mexico
    不详
    CP 02200, Mexico
    [J]. Lect. Notes Comput. Sci, 2013, (452-463):
  • [33] Decomposing document images by heuristic search
    Gao, Dashan
    Wang, Yizhou
    [J]. ENERGY MINIMIZATION METHODS IN COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS, 2007, 4679 : 97 - +
  • [34] Employing Document Dependency in Blog Search
    Keikha, Mostafa
    Crestani, Fabio
    Carman, Mark James
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2012, 63 (02): : 354 - 365
  • [35] Document Clustering with Evolved Search Queries
    Hirsch, Laurence
    Di Nuovo, Alessandro
    [J]. 2017 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2017, : 1239 - 1246
  • [36] Document image decoding by heuristic search
    Kam, AC
    Kopec, GE
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1996, 18 (09) : 945 - 950
  • [37] Document optimization for fulltext search engines
    Simek, Pavel
    Vanek, Jiri
    Jarolimek, Jan
    [J]. AGRARIAN PERSPECTIVES XVIII, VOLS 1 AND 2, 2009, : 781 - 786
  • [38] THE EFFECTIVENESS OF DOCUMENT NEIGHBORING IN SEARCH ENHANCEMENT
    WILBUR, WJ
    COFFEE, L
    [J]. INFORMATION PROCESSING & MANAGEMENT, 1994, 30 (02) : 253 - 266
  • [39] On Perfect Document Rankings for Expert Search
    Macdonald, Craig
    Ounis, Iadh
    [J]. PROCEEDINGS 32ND ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2009, : 740 - 741
  • [40] Enhancing web search by using query-based clusters and multi-document summaries
    Qumsiyeh, Rani
    Ng, Yiu-Kai
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2016, 47 (02) : 355 - 380