Proposal and study of statistical features for string similarity computation and classification

被引：2

作者：

Rodrigues, E. O. ^{[1
]}

Casanova, D. ^{[1
]}

Teixeira, M. ^{[1
]}

Pegorini, V ^{[1
]}

Favarim, F. ^{[1
]}

Clua, E. ^{[2
]}

Conci, A. ^{[2
]}

Liatsis, Panos ^{[3
]}

机构：

[1] Univ Tecnol Fed Parana UTFPR, Acad Dept Informat, Apucarana, Parana, Brazil

[2] Univ Fed Fluminense UFF, Dept Comp Sci, Rio De Janeiro, Brazil

[3] Khalifa Univ, Dept Elect Engn & Comp Sci, Abu Dhabi, U Arab Emirates

来源：

INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT | 2020年 / 12卷 / 03期

关键词：

word comparison; string similarity; classification; statistical features; text mining; optical character recognition; OCR; text plagiarism; text entailment; supervised learning;

D O I：

10.1504/IJDMMM.2020.108731

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.

引用

页码：277 / 307

页数：31

共 50 条

[21] Time series classification based on statistical features
Yuxia Lei
Zhongqiang Wu
EURASIP Journal on Wireless Communications and Networking, 2020
[22] Highly discriminative statistical features for email classification
Juan Carlos Gomez
Erik Boiy
Marie-Francine Moens
Knowledge and Information Systems, 2012, 31 : 23 - 53
[23] Time series classification based on statistical features
Lei, Yuxia
Wu, Zhongqiang
EURASIP JOURNAL ON WIRELESS COMMUNICATIONS AND NETWORKING, 2020, 2020 (01)
[24] Directional statistical Gabor features for texture classification
Kim, Nam Chul
So, Hyun Joo
PATTERN RECOGNITION LETTERS, 2018, 112 : 18 - 26
[25] Improvement of statistical and fractal features for texture classification
Popescu, D. (dan_popescu_2002@yahoo.com), 2013, Springer Verlag (187 AISC):
[26] Encrypted Traffic Classification Using Statistical Features
Mahdavi, Ehsan
Fanian, Ali
Hassannejad, Homa
ISECURE-ISC INTERNATIONAL JOURNAL OF INFORMATION SECURITY, 2018, 10 (01): : 29 - 43
[27] Statistical Features and Classification of Normal and Abnormal Mammograms
Ben Youssef, Youssef
Abdelmounim, El Hassane
Rabeh, Abderahmane
Zbitou, Jamal
Belaguid, Abdelaziz
2014 INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS (ICMCS), 2014, : 448 - 452
[28] Highly discriminative statistical features for email classification
Gomez, Juan Carlos
Boiy, Erik
Moens, Marie-Francine
KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 31 (01) : 23 - 53
[29] TINTIN: Exploiting Target Features for Signaling Network Similarity Computation and Ranking
Chua, Huey Eng
Bhowmick, Sourav S.
Tucker-Kellogg, Lisa
ACM-BCB' 2017: PROCEEDINGS OF THE 8TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY,AND HEALTH INFORMATICS, 2017, : 340 - 345
[30] Semantic Document Classification Based on Semantic Similarity Computation and Correlation Analysis
Yang, Shuo
Wei, Ran
Guo, Jingzhi
ADVANCES IN E-BUSINESS ENGINEERING FOR UBIQUITOUS COMPUTING, 2020, 41 : 3 - 18

← 1 2 3 4 5 →