IMPROVEMENT OF CLUSTERING ALGORITHMS BY IMPLEMENTATION OF SPELLING BASED RANKING

被引：0

作者：

Bryer, Evan ^{[1
]}

Rhujittawiwat, Theppatorn ^{[1
]}

Rose, John R. ^{[1
]}

Wilder, Colin F. ^{[2
]}

机构：

[1] Univ South Carolina, Coll Engn & Comp, Columbia, SC 29208 USA

[2] Univ South Carolina, Ctr Digital Humanities, Columbia, SC 29208 USA

来源：

IADIS-INTERNATIONAL JOURNAL ON COMPUTER SCIENCE AND INFORMATION SYSTEMS | 2021年 / 16卷 / 02期

关键词：

Pre-Processing; Clustering; Cleaning; Data Mining; Spellchecking;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The goal of this paper is to modify an existing clustering algorithm with the use of the Hunspell spell checker to specialize it for the use of cleaning early modern European book title data. Duplicate and corrupted data is a constant concern for data analysis, and clustering has been identified to be a robust tool for normalizing and cleaning data such as ours. In particular, our data comprises over 5 million books published in European languages between 1500 and 1800 in the Machine-Readable Cataloging (MARC) data format from 17,983 libraries in 123 countries. However, as each library individually catalogued their records, many duplicative and inaccurate records exist in the data set. Additionally, each language evolved over the 300-year period we are studying, and as such many of the words had their spellings altered. Without cleaning and normalizing this data, it would be difficult to find coherent trends, as much of the data may be missed in the query. In previous research, we have identified the use of Prediction by Partial Matching to provide the most increase in base accuracy when applied to dirty data of similar construct to our data set. However, there are many cases in which the correct book title may not be the most common, either when only two values exist in a cluster, or the dirty title exists in more records. In these cases, a language agnostic clustering algorithm would normalize the incorrect title and lower the overall accuracy of the data set. By implementing the Hunspell spell checker into the clustering algorithm, using it to rank clusters by the number of words not found in their dictionary, we can drastically lower the cases of this occurring. Indeed, this ranking algorithm proved to increase the overall accuracy of the clustered data by as much as 25% over the unmodified Prediction by Partial Matching algorithm.

引用

页码：45 / 60

页数：16

共 50 条

[1] Unsupervised ranking of clustering algorithms by INFOMAX
Sikdar, Sandipan
Mukherjee, Animesh
Marsili, Matteo
[J]. PLOS ONE, 2020, 15 (10):
[2] Deterministic Pivoting Algorithms for Constrained Ranking and Clustering Problems
van Zuylen, Anke
Williamson, David P.
[J]. MATHEMATICS OF OPERATIONS RESEARCH, 2009, 34 (03) : 594 - 620
[3] Deterministic pivoting algorithms for constrained ranking and clustering problems
van Zuylen, Anke
Hegde, Rajneesh
Jain, Kamal
Wiliamson, David P.
[J]. PROCEEDINGS OF THE EIGHTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2007, : 405 - +
[4] FPGA implementation of hierarchical clustering algorithms
Niamat, MY
Bitter, D
Jamali, MM
[J]. ISCAS '98 - PROCEEDINGS OF THE 1998 INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1-6, 1998, : D70 - D73
[5] Study and Implementation of Clustering Algorithms in R
Meena, Gaurav
Chauhan, Pradeep Singh
Choudhary, Ravi Raj
[J]. 2017 INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN COMPUTER, ELECTRICAL, ELECTRONICS AND COMMUNICATION (CTCEEC), 2017, : 977 - 981
[6] PARALLEL IMPLEMENTATION OF FAST CLUSTERING ALGORITHMS
BRUYNOOGHE, M
[J]. HIGH PERFORMANCE COMPUTING /, 1989, : 65 - 78
[7] Analysis and Clustering-Based Improvement of Particle Filter Optimization Algorithms
Kenyeres, Eva
Abonyi, Janos
[J]. IEEE ACCESS, 2024, 12 : 55600 - 55619
[8] Research and Implementation of Clustering Analysis Algorithms Based on I-MINER
Qun, Zhang
[J]. 2013 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCES AND APPLICATIONS (CSA), 2013, : 254 - 257
[9] The Improvement and Implementation of Clustering Algorithm Based on Multi-core Computing
Dong Liangyu
Xu Dongping
Liu Zhenzhen
Wang Shasha
[J]. PROCEEDINGS OF 2015 IEEE 14TH INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS & COGNITIVE COMPUTING (ICCI*CC), 2015, : 405 - 411
[10] Deterministic algorithms for rank aggregation and other ranking and clustering problems
van Zuylen, Anke
Williamson, David P.
[J]. APPROXIMATION AND ONLINE ALGORITHMS, 2008, 4927 : 260 - +

← 1 2 3 4 5 →