Document Similarity Detection using K-Means and Cosine Distance

被引:0
|
作者
Usino, Wendi [1 ]
Prabuwono, Anton Satria [1 ,2 ]
Allehaibi, Khalid Hamed S. [3 ]
Bramantoro, Arif [1 ,2 ]
Hasniaty, A. [4 ,5 ]
Amaldi, Wahyu [1 ]
机构
[1] Univ Budi Luhur, Fac Informat Technol, Jakarta, Indonesia
[2] Rabigh King Abdulaziz Univ, Fac Comp & Informat Technol, Rabigh, Saudi Arabia
[3] King Abdulaziz Univ, Fac Comp & Informat Technol, Jeddah, Saudi Arabia
[4] Univ Kebangsaan Malaysia, Inst Visual Informat, Bangi, Selangor, Malaysia
[5] Univ Hasanuddin, Fac Engn, Makassar, Indonesia
关键词
K-means; cosine distance; cluster; document similarity; document frequency; inverse document frequency; preprocessing; vector space model;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A two-year study by the Ministry of Research, Technology and Education in Indonesia presented the evaluation of most universities in Indonesia. The findings of the evaluation are the peculiarities of various dissertation softcopies of doctoral students which are similar to any texts available on internet. The suspected plagiarism behavior has a negative effect on both students and faculty members. The main reason behind this behavior is the lack of standardized awareness among faculty members with regard to plagiarism. Therefore, this study proposes a computerized system that is able to detect plagiarism information by using K-means and cosine distance algorithm. The process starts from preprocessing process that includes a novel step of checking Indonesian big dictionary, vector space model design, and the combined calculation of K-means and cosine distance from 17 documents as test data. The result of this study generally shows that the documents have detection accuracy of 93.33%.
引用
收藏
页码:165 / 170
页数:6
相关论文
共 50 条
  • [1] Document similarity detection using K-Means and cosine distance
    Usino W.
    Prabuwono A.S.
    Allehaibi K.H.S.
    Bramantoro A.
    Hasniaty A.
    Amaldi W.
    Intl. J. Adv. Comput. Sci. Appl., 2 (165-170): : 165 - 170
  • [2] An Improved K-means Algorithm Using Modified Cosine Distance Measure for Document Clustering Using Mahout with Hadoop
    Sahu, Lokesh
    Mohan, Biju R.
    2014 9TH INTERNATIONAL CONFERENCE ON INDUSTRIAL AND INFORMATION SYSTEMS (ICIIS), 2014, : 1048 - 1052
  • [3] Improving Efficiency of Similarity of Document Network Using Bisect K-Means
    Kadam, Pradnya
    Mate, G. S.
    2017 INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, CONTROL AND AUTOMATION (ICCUBEA), 2017,
  • [4] College Suggesta: Enhancing Choices with K-Means, KNN, and Cosine Similarity Using Flutter and Django
    Veeraiah, Dasari Chinna
    Anuradha, Ch.
    Nithin, Palli Sai
    Kumari, Peruri Aruna
    IEEE International Conference on Signal Processing and Advance Research in Computing, SPARC 2024, 2024,
  • [5] Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
    Buatoom, Uraiwan
    Kongprawechnon, Waree
    Theeramunkong, Thanaruk
    SYMMETRY-BASEL, 2020, 12 (06):
  • [6] An Empirical Evaluation of K-Means Clustering Algorithm Using Different Distance/Similarity Metrics
    Gupta, Manoj Kumar
    Chandra, Pravin
    PROCEEDINGS OF ICETIT 2019: EMERGING TRENDS IN INFORMATION TECHNOLOGY, 2020, 605 : 884 - 892
  • [7] Anomaly Detection by Using Streaming K-Means and Batch K-Means
    Wang, Zhuo
    Zhou, Yanghui
    Li, Gangmin
    2020 5TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA ANALYTICS (IEEE ICBDA 2020), 2020, : 11 - 17
  • [8] A modified version of the K-means algorithm based on the shape similarity distance
    Li, Dan
    Li, Xinbao
    FRONTIERS OF MECHANICAL ENGINEERING AND MATERIALS ENGINEERING II, PTS 1 AND 2, 2014, 457-458 : 1064 - 1068
  • [9] Improved Document Clustering using K-means Algorithm
    Bide, Pramod
    Shedge, Rajashree
    2015 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND COMMUNICATION TECHNOLOGIES, 2015,
  • [10] Global k-means with similarity functions
    López-Escobar, S
    Carrasco-Ochoa, JA
    Martífnez-Trinidad, JF
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS AND APPLICATIONS, PROCEEDINGS, 2005, 3773 : 392 - 399