Enhancing Target Document Search in CopyCatch: A Focus on Thai and English document

被引:0
|
作者
Klaithin, Supon [1 ]
Kasuriya, Sawit [1 ]
机构
[1] Natl Sci & Technol Dev Agcy NSTDA, Natl Elect & Comp Technol Ctr NECTEC, Khlong Luang, Thailand
关键词
plagiarism; document search; similarity; target document; document corpus;
D O I
10.1109/iSAI-NLP60301.2023.10354947
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
CopyCatch, is an automatic software designed to detect plagiarism in electronic documents written in both Thai and English. It consists of two main components: Target Document Search and Document Comparison and Similarity Calculation. The Target Document Search is an important step in identifying a list of target documents from the document corpus that are expected to have content similar to the document being examined. In this paper, we propose a novel approach to building the document corpus and comparing input documents with target documents by focusing on Thai vowels and English alphabets. The comparison process involves controlling three variables: passage size, window size, overlap of 3-gram chunks, and the frequency matching of 3-gram chunk in the document corpus. We found that setting the passage size to 10,000 characters, using a window size and overlap of 10:5 for 3-gram chunk, and applying a frequency matching condition of 1 had a substantial impact on the accuracy and number of files identified as similar.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Spell checker for Thai document
    Watcharabutsarakham, Sarin
    [J]. TENCON 2005 - 2005 IEEE REGION 10 CONFERENCE, VOLS 1-5, 2006, : 1919 - 1922
  • [2] Document Visual Similarity Measure For Document Search
    Ahmadullin, Ildus
    Allebach, Jan P.
    Damera-Venkata, Niranjan
    Fan, Jian
    Lee, Seungyon
    Lin, Qian
    Liu, Jerry
    [J]. DOCENG 2011: PROCEEDINGS OF THE 2011 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2011, : 139 - 142
  • [3] Backward transliteration for Thai document retrieval
    Kawtrakul, A
    Deemagarn, A
    Thumkanon, C
    Khantonthong, N
    McFetridge, P
    [J]. APCCAS '98 - IEEE ASIA-PACIFIC CONFERENCE ON CIRCUITS AND SYSTEMS: MICROELECTRONICS AND INTEGRATING SYSTEMS, 1998, : 563 - 566
  • [4] InDUS : Incremental Document Understanding System Focus on Document Classification
    d'Andecy, Vincent Poulain
    Ogier, Jean-Marc
    Joseph, Aurelie
    [J]. 2018 13TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS), 2018, : 239 - 244
  • [5] Simple Document-by-Document Search Tool "Fuwatto Search" Using Web API
    Takaku, Masao
    Egusa, Yuka
    [J]. EMERGENCE OF DIGITAL LIBRARIES - RESEARCH AND PRACTICES, 2014, 8839 : 312 - 319
  • [6] Estimating Document Focus Time
    Jatowt, Adam
    Yeung, Ching-Man Au
    Tanaka, Katsumi
    [J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 2273 - 2278
  • [7] Researching in English: Document Study
    Sawyer, Wayne
    [J]. ENGLISH IN AUSTRALIA, 2015, 50 (03) : 67 - 70
  • [8] Multimedia document search on the Web
    Amato, G
    Rabitti, F
    Savino, P
    [J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7): : 604 - 606
  • [9] Document Difficulty Aspects for Medical Practitioners: Enhancing Information Retrieval in Personalized Search Engines
    Frihat, Sameh
    Beckmann, Catharina Lena
    Hartmann, Eva Maria
    Fuhr, Norbert
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (19):
  • [10] Chuweb21D: A Deduped English Document Collection forWeb Search Tasks
    Chu, Zhumin
    Sakai, Tetsuya
    Ai, Qingyao
    Liu, Yiqun
    [J]. ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL IN THE ASIA PACIFIC REGION, SIGIR-AP 2023, 2023, : 63 - 72