Enhancing Target Document Search in CopyCatch: A Focus on Thai and English document

被引：0

作者：

Klaithin, Supon ^{[1
]}

Kasuriya, Sawit ^{[1
]}

机构：

[1] Natl Sci & Technol Dev Agcy NSTDA, Natl Elect & Comp Technol Ctr NECTEC, Khlong Luang, Thailand

来源：

2023 18TH INTERNATIONAL JOINT SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE PROCESSING, ISAI-NLP | 2023年

关键词：

plagiarism; document search; similarity; target document; document corpus;

D O I：

10.1109/iSAI-NLP60301.2023.10354947

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

CopyCatch, is an automatic software designed to detect plagiarism in electronic documents written in both Thai and English. It consists of two main components: Target Document Search and Document Comparison and Similarity Calculation. The Target Document Search is an important step in identifying a list of target documents from the document corpus that are expected to have content similar to the document being examined. In this paper, we propose a novel approach to building the document corpus and comparing input documents with target documents by focusing on Thai vowels and English alphabets. The comparison process involves controlling three variables: passage size, window size, overlap of 3-gram chunks, and the frequency matching of 3-gram chunk in the document corpus. We found that setting the passage size to 10,000 characters, using a window size and overlap of 10:5 for 3-gram chunk, and applying a frequency matching condition of 1 had a substantial impact on the accuracy and number of files identified as similar.

引用

页数：6

共 50 条

[1] Spell checker for Thai document
Watcharabutsarakham, Sarin
[J]. TENCON 2005 - 2005 IEEE REGION 10 CONFERENCE, VOLS 1-5, 2006, : 1919 - 1922
[2] Document Visual Similarity Measure For Document Search
Ahmadullin, Ildus
Allebach, Jan P.
Damera-Venkata, Niranjan
Fan, Jian
Lee, Seungyon
Lin, Qian
Liu, Jerry
[J]. DOCENG 2011: PROCEEDINGS OF THE 2011 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, 2011, : 139 - 142
[3] Backward transliteration for Thai document retrieval
Kawtrakul, A
Deemagarn, A
Thumkanon, C
Khantonthong, N
McFetridge, P
[J]. APCCAS '98 - IEEE ASIA-PACIFIC CONFERENCE ON CIRCUITS AND SYSTEMS: MICROELECTRONICS AND INTEGRATING SYSTEMS, 1998, : 563 - 566
[4] InDUS : Incremental Document Understanding System Focus on Document Classification
d'Andecy, Vincent Poulain
Ogier, Jean-Marc
Joseph, Aurelie
[J]. 2018 13TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS), 2018, : 239 - 244
[5] Simple Document-by-Document Search Tool "Fuwatto Search" Using Web API
Takaku, Masao
Egusa, Yuka
[J]. EMERGENCE OF DIGITAL LIBRARIES - RESEARCH AND PRACTICES, 2014, 8839 : 312 - 319
[6] Estimating Document Focus Time
Jatowt, Adam
Yeung, Ching-Man Au
Tanaka, Katsumi
[J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 2273 - 2278
[7] Researching in English: Document Study
Sawyer, Wayne
[J]. ENGLISH IN AUSTRALIA, 2015, 50 (03) : 67 - 70
[8] Multimedia document search on the Web
Amato, G
Rabitti, F
Savino, P
[J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7): : 604 - 606
[9] Document Difficulty Aspects for Medical Practitioners: Enhancing Information Retrieval in Personalized Search Engines
Frihat, Sameh
Beckmann, Catharina Lena
Hartmann, Eva Maria
Fuhr, Norbert
[J]. APPLIED SCIENCES-BASEL, 2023, 13 (19):
[10] Chuweb21D: A Deduped English Document Collection forWeb Search Tasks
Chu, Zhumin
Sakai, Tetsuya
Ai, Qingyao
Liu, Yiqun
[J]. ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL IN THE ASIA PACIFIC REGION, SIGIR-AP 2023, 2023, : 63 - 72

← 1 2 3 4 5 →