Document Similarity Detection using K-Means and Cosine Distance

被引：0

作者：

Usino, Wendi ^{[1
]}

Prabuwono, Anton Satria ^{[1
,2
]}

Allehaibi, Khalid Hamed S. ^{[3
]}

Bramantoro, Arif ^{[1
,2
]}

Hasniaty, A. ^{[4
,5
]}

Amaldi, Wahyu ^{[1
]}

机构：

[1] Univ Budi Luhur, Fac Informat Technol, Jakarta, Indonesia

[2] Rabigh King Abdulaziz Univ, Fac Comp & Informat Technol, Rabigh, Saudi Arabia

[3] King Abdulaziz Univ, Fac Comp & Informat Technol, Jeddah, Saudi Arabia

[4] Univ Kebangsaan Malaysia, Inst Visual Informat, Bangi, Selangor, Malaysia

[5] Univ Hasanuddin, Fac Engn, Makassar, Indonesia

来源：

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS | 2019年 / 10卷 / 02期

关键词：

K-means; cosine distance; cluster; document similarity; document frequency; inverse document frequency; preprocessing; vector space model;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

A two-year study by the Ministry of Research, Technology and Education in Indonesia presented the evaluation of most universities in Indonesia. The findings of the evaluation are the peculiarities of various dissertation softcopies of doctoral students which are similar to any texts available on internet. The suspected plagiarism behavior has a negative effect on both students and faculty members. The main reason behind this behavior is the lack of standardized awareness among faculty members with regard to plagiarism. Therefore, this study proposes a computerized system that is able to detect plagiarism information by using K-means and cosine distance algorithm. The process starts from preprocessing process that includes a novel step of checking Indonesian big dictionary, vector space model design, and the combined calculation of K-means and cosine distance from 17 documents as test data. The result of this study generally shows that the documents have detection accuracy of 93.33%.

引用

页码：165 / 170

页数：6

共 50 条

[31] K-Means Clustering using Max-min Distance Measure
Visalakshi, N. Karthikeyani
Suguna, J.
2009 ANNUAL MEETING OF THE NORTH AMERICAN FUZZY INFORMATION PROCESSING SOCIETY, 2009, : 74 - 79
[32] Transitive Distance Clustering with K-Means Duality
Yu, Zhiding
Xu, Chunjing
Meng, Deyu
Hui, Zhuo
Xiao, Fanyi
Liu, Wenbo
Liu, Jianzhuang
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 987 - 994
[33] Clustering of Image Data Using K-Means and Fuzzy K-Means
Rahmani, Md. Khalid Imam
Pal, Naina
Arora, Kamiya
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2014, 5 (07) : 160 - 163
[34] Classification via k-Means Clustering and Distance-Based Outlier Detection
Songma, Surasit
Chimphlee, Witcha
Maichalernnukul, Kiattisak
Sanguansat, Parinya
2012 TENTH INTERNATIONAL CONFERENCE ON ICT AND KNOWLEDGE ENGINEERING, 2012, : 125 - 128
[35] Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering
Cevahir, Ali
TRENDS AND APPLICATIONS IN KNOWLEDGE DISCOVERY AND DATA MINING, 2014, 8643 : 231 - 238
[36] An Improved K-Means Algorithm Based on Contour Similarity
Zhao, Jing
Bao, Yanke
Li, Dongsheng
Guan, Xinguo
MATHEMATICS, 2024, 12 (14)
[37] Efficient Sparse Spherical k-Means for Document Clustering
Knittel, Johannes
Koch, Steffen
Ertl, Thomas
PROCEEDINGS OF THE 21ST ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG '21), 2021,
[38] An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm
Sardar T.H.
Ansari Z.
Ansari, Zahid (zahid_cs@pace.edu.in), 1600, Springer (101): : 641 - 650
[39] K-means based method for overlapping document clustering
Beltran, Beatriz
Vilarino, Darnes
Martinez-Trinidad, Jose Fco.
Carrasco-Ochoa, J. A.
Pinto, David
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (02) : 2127 - 2135
[40] Text Document Clustering Based on Density K-means
Wu, Di
Zeng, Yan
Qu, Yin-chuan
INTERNATIONAL CONFERENCE ON COMPUTER, MECHATRONICS AND ELECTRONIC ENGINEERING (CMEE 2016), 2016,

← 1 2 3 4 5 →