Document Similarity Detection using K-Means and Cosine Distance

被引:0
|
作者
Usino, Wendi [1 ]
Prabuwono, Anton Satria [1 ,2 ]
Allehaibi, Khalid Hamed S. [3 ]
Bramantoro, Arif [1 ,2 ]
Hasniaty, A. [4 ,5 ]
Amaldi, Wahyu [1 ]
机构
[1] Univ Budi Luhur, Fac Informat Technol, Jakarta, Indonesia
[2] Rabigh King Abdulaziz Univ, Fac Comp & Informat Technol, Rabigh, Saudi Arabia
[3] King Abdulaziz Univ, Fac Comp & Informat Technol, Jeddah, Saudi Arabia
[4] Univ Kebangsaan Malaysia, Inst Visual Informat, Bangi, Selangor, Malaysia
[5] Univ Hasanuddin, Fac Engn, Makassar, Indonesia
关键词
K-means; cosine distance; cluster; document similarity; document frequency; inverse document frequency; preprocessing; vector space model;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A two-year study by the Ministry of Research, Technology and Education in Indonesia presented the evaluation of most universities in Indonesia. The findings of the evaluation are the peculiarities of various dissertation softcopies of doctoral students which are similar to any texts available on internet. The suspected plagiarism behavior has a negative effect on both students and faculty members. The main reason behind this behavior is the lack of standardized awareness among faculty members with regard to plagiarism. Therefore, this study proposes a computerized system that is able to detect plagiarism information by using K-means and cosine distance algorithm. The process starts from preprocessing process that includes a novel step of checking Indonesian big dictionary, vector space model design, and the combined calculation of K-means and cosine distance from 17 documents as test data. The result of this study generally shows that the documents have detection accuracy of 93.33%.
引用
收藏
页码:165 / 170
页数:6
相关论文
共 50 条
  • [31] K-Means Clustering using Max-min Distance Measure
    Visalakshi, N. Karthikeyani
    Suguna, J.
    2009 ANNUAL MEETING OF THE NORTH AMERICAN FUZZY INFORMATION PROCESSING SOCIETY, 2009, : 74 - 79
  • [32] Transitive Distance Clustering with K-Means Duality
    Yu, Zhiding
    Xu, Chunjing
    Meng, Deyu
    Hui, Zhuo
    Xiao, Fanyi
    Liu, Wenbo
    Liu, Jianzhuang
    2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 987 - 994
  • [33] Clustering of Image Data Using K-Means and Fuzzy K-Means
    Rahmani, Md. Khalid Imam
    Pal, Naina
    Arora, Kamiya
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2014, 5 (07) : 160 - 163
  • [34] Classification via k-Means Clustering and Distance-Based Outlier Detection
    Songma, Surasit
    Chimphlee, Witcha
    Maichalernnukul, Kiattisak
    Sanguansat, Parinya
    2012 TENTH INTERNATIONAL CONFERENCE ON ICT AND KNOWLEDGE ENGINEERING, 2012, : 125 - 128
  • [35] Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering
    Cevahir, Ali
    TRENDS AND APPLICATIONS IN KNOWLEDGE DISCOVERY AND DATA MINING, 2014, 8643 : 231 - 238
  • [36] An Improved K-Means Algorithm Based on Contour Similarity
    Zhao, Jing
    Bao, Yanke
    Li, Dongsheng
    Guan, Xinguo
    MATHEMATICS, 2024, 12 (14)
  • [37] Efficient Sparse Spherical k-Means for Document Clustering
    Knittel, Johannes
    Koch, Steffen
    Ertl, Thomas
    PROCEEDINGS OF THE 21ST ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG '21), 2021,
  • [38] An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm
    Sardar T.H.
    Ansari Z.
    Ansari, Zahid (zahid_cs@pace.edu.in), 1600, Springer (101): : 641 - 650
  • [39] K-means based method for overlapping document clustering
    Beltran, Beatriz
    Vilarino, Darnes
    Martinez-Trinidad, Jose Fco.
    Carrasco-Ochoa, J. A.
    Pinto, David
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (02) : 2127 - 2135
  • [40] Text Document Clustering Based on Density K-means
    Wu, Di
    Zeng, Yan
    Qu, Yin-chuan
    INTERNATIONAL CONFERENCE ON COMPUTER, MECHATRONICS AND ELECTRONIC ENGINEERING (CMEE 2016), 2016,