Document Similarity Detection using K-Means and Cosine Distance

被引:0
|
作者
Usino, Wendi [1 ]
Prabuwono, Anton Satria [1 ,2 ]
Allehaibi, Khalid Hamed S. [3 ]
Bramantoro, Arif [1 ,2 ]
Hasniaty, A. [4 ,5 ]
Amaldi, Wahyu [1 ]
机构
[1] Univ Budi Luhur, Fac Informat Technol, Jakarta, Indonesia
[2] Rabigh King Abdulaziz Univ, Fac Comp & Informat Technol, Rabigh, Saudi Arabia
[3] King Abdulaziz Univ, Fac Comp & Informat Technol, Jeddah, Saudi Arabia
[4] Univ Kebangsaan Malaysia, Inst Visual Informat, Bangi, Selangor, Malaysia
[5] Univ Hasanuddin, Fac Engn, Makassar, Indonesia
关键词
K-means; cosine distance; cluster; document similarity; document frequency; inverse document frequency; preprocessing; vector space model;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A two-year study by the Ministry of Research, Technology and Education in Indonesia presented the evaluation of most universities in Indonesia. The findings of the evaluation are the peculiarities of various dissertation softcopies of doctoral students which are similar to any texts available on internet. The suspected plagiarism behavior has a negative effect on both students and faculty members. The main reason behind this behavior is the lack of standardized awareness among faculty members with regard to plagiarism. Therefore, this study proposes a computerized system that is able to detect plagiarism information by using K-means and cosine distance algorithm. The process starts from preprocessing process that includes a novel step of checking Indonesian big dictionary, vector space model design, and the combined calculation of K-means and cosine distance from 17 documents as test data. The result of this study generally shows that the documents have detection accuracy of 93.33%.
引用
收藏
页码:165 / 170
页数:6
相关论文
共 50 条
  • [21] An Improved K-means Algorithm for Document Clustering
    Wu, Guohua
    Lin, Hairong
    Fu, Ershuai
    Wang, Liuyang
    2015 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND MECHANICAL AUTOMATION (CSMA), 2015, : 65 - 69
  • [22] Harmony K-means algorithm for document clustering
    Mehrdad Mahdavi
    Hassan Abolhassani
    Data Mining and Knowledge Discovery, 2009, 18 : 370 - 391
  • [23] Harmony K-means algorithm for document clustering
    Mahdavi, Mehrdad
    Abolhassani, Hassan
    DATA MINING AND KNOWLEDGE DISCOVERY, 2009, 18 (03) : 370 - 391
  • [24] An Efficient Data Structure for Document Clustering Using K-Means Algorithm
    Killani, Ramanji
    Satapathy, Suresh Chandra
    Sowjanya, A. M.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS 2012 (INDIA 2012), 2012, 132 : 337 - +
  • [25] Feature Selection Using Euclidean Distance and Cosine Similarity for Intrusion Detection Model
    Suebsing, Anirut
    Hiransakolwong, Nualsawat
    2009 FIRST ASIAN CONFERENCE ON INTELLIGENT INFORMATION AND DATABASE SYSTEMS, 2009, : 86 - 91
  • [26] K-means algorithm with a novel distance measure
    Abudalfa, Shadi I.
    Mikki, Mohammad
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2013, 21 (06) : 1665 - 1684
  • [27] K-Means Clustering with Local Distance Privacy
    Yang, Mengmeng
    Huang, Longxia
    Tang, Chenghua
    BIG DATA MINING AND ANALYTICS, 2023, 6 (04) : 433 - 442
  • [28] Mahalanobis Distance Based K-Means Clustering
    Brown, Paul O.
    Chiang, Meng Ching
    Guo, Shiqing
    Jin, Yingzi
    Leung, Carson K.
    Murray, Evan L.
    Pazdor, Adam G. M.
    Cuzzocrea, Alfredo
    BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2022, 2022, 13428 : 256 - 262
  • [29] On the Geodesic Distance in Shapes K-means Clustering
    Gattone, Stefano Antonio
    De Sanctis, Angela
    Puechmorel, Stephane
    Nicol, Florence
    ENTROPY, 2018, 20 (09)
  • [30] K-Means over Incomplete Datasets Using Mean Euclidean Distance
    AbdAllah, Loai
    Shimshoni, Ilan
    MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION (MLDM 2016), 2016, 9729 : 113 - 127