Document Similarity Detection using K-Means and Cosine Distance

被引：0

作者：

Usino, Wendi ^{[1
]}

Prabuwono, Anton Satria ^{[1
,2
]}

Allehaibi, Khalid Hamed S. ^{[3
]}

Bramantoro, Arif ^{[1
,2
]}

Hasniaty, A. ^{[4
,5
]}

Amaldi, Wahyu ^{[1
]}

机构：

[1] Univ Budi Luhur, Fac Informat Technol, Jakarta, Indonesia

[2] Rabigh King Abdulaziz Univ, Fac Comp & Informat Technol, Rabigh, Saudi Arabia

[3] King Abdulaziz Univ, Fac Comp & Informat Technol, Jeddah, Saudi Arabia

[4] Univ Kebangsaan Malaysia, Inst Visual Informat, Bangi, Selangor, Malaysia

[5] Univ Hasanuddin, Fac Engn, Makassar, Indonesia

来源：

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS | 2019年 / 10卷 / 02期

关键词：

K-means; cosine distance; cluster; document similarity; document frequency; inverse document frequency; preprocessing; vector space model;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

A two-year study by the Ministry of Research, Technology and Education in Indonesia presented the evaluation of most universities in Indonesia. The findings of the evaluation are the peculiarities of various dissertation softcopies of doctoral students which are similar to any texts available on internet. The suspected plagiarism behavior has a negative effect on both students and faculty members. The main reason behind this behavior is the lack of standardized awareness among faculty members with regard to plagiarism. Therefore, this study proposes a computerized system that is able to detect plagiarism information by using K-means and cosine distance algorithm. The process starts from preprocessing process that includes a novel step of checking Indonesian big dictionary, vector space model design, and the combined calculation of K-means and cosine distance from 17 documents as test data. The result of this study generally shows that the documents have detection accuracy of 93.33%.

引用

页码：165 / 170

页数：6

共 50 条

[21] An Improved K-means Algorithm for Document Clustering
Wu, Guohua
Lin, Hairong
Fu, Ershuai
Wang, Liuyang
2015 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND MECHANICAL AUTOMATION (CSMA), 2015, : 65 - 69
[22] Harmony K-means algorithm for document clustering
Mehrdad Mahdavi
Hassan Abolhassani
Data Mining and Knowledge Discovery, 2009, 18 : 370 - 391
[23] Harmony K-means algorithm for document clustering
Mahdavi, Mehrdad
Abolhassani, Hassan
DATA MINING AND KNOWLEDGE DISCOVERY, 2009, 18 (03) : 370 - 391
[24] An Efficient Data Structure for Document Clustering Using K-Means Algorithm
Killani, Ramanji
Satapathy, Suresh Chandra
Sowjanya, A. M.
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS 2012 (INDIA 2012), 2012, 132 : 337 - +
[25] Feature Selection Using Euclidean Distance and Cosine Similarity for Intrusion Detection Model
Suebsing, Anirut
Hiransakolwong, Nualsawat
2009 FIRST ASIAN CONFERENCE ON INTELLIGENT INFORMATION AND DATABASE SYSTEMS, 2009, : 86 - 91
[26] K-means algorithm with a novel distance measure
Abudalfa, Shadi I.
Mikki, Mohammad
TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2013, 21 (06) : 1665 - 1684
[27] K-Means Clustering with Local Distance Privacy
Yang, Mengmeng
Huang, Longxia
Tang, Chenghua
BIG DATA MINING AND ANALYTICS, 2023, 6 (04) : 433 - 442
[28] Mahalanobis Distance Based K-Means Clustering
Brown, Paul O.
Chiang, Meng Ching
Guo, Shiqing
Jin, Yingzi
Leung, Carson K.
Murray, Evan L.
Pazdor, Adam G. M.
Cuzzocrea, Alfredo
BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2022, 2022, 13428 : 256 - 262
[29] On the Geodesic Distance in Shapes K-means Clustering
Gattone, Stefano Antonio
De Sanctis, Angela
Puechmorel, Stephane
Nicol, Florence
ENTROPY, 2018, 20 (09)
[30] K-Means over Incomplete Datasets Using Mean Euclidean Distance
AbdAllah, Loai
Shimshoni, Ilan
MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION (MLDM 2016), 2016, 9729 : 113 - 127

← 1 2 3 4 5 →