Proposal and study of statistical features for string similarity computation and classification

被引:2
|
作者
Rodrigues, E. O. [1 ]
Casanova, D. [1 ]
Teixeira, M. [1 ]
Pegorini, V [1 ]
Favarim, F. [1 ]
Clua, E. [2 ]
Conci, A. [2 ]
Liatsis, Panos [3 ]
机构
[1] Univ Tecnol Fed Parana UTFPR, Acad Dept Informat, Apucarana, Parana, Brazil
[2] Univ Fed Fluminense UFF, Dept Comp Sci, Rio De Janeiro, Brazil
[3] Khalifa Univ, Dept Elect Engn & Comp Sci, Abu Dhabi, U Arab Emirates
关键词
word comparison; string similarity; classification; statistical features; text mining; optical character recognition; OCR; text plagiarism; text entailment; supervised learning;
D O I
10.1504/IJDMMM.2020.108731
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.
引用
收藏
页码:277 / 307
页数:31
相关论文
共 50 条
  • [31] Comparative Study of Motion Features for Similarity-Based Modeling and Classification of Unsafe Actions in Construction
    Han, SangUk
    Lee, SangHyun
    Pena-Mora, Feniosky
    JOURNAL OF COMPUTING IN CIVIL ENGINEERING, 2014, 28 (05)
  • [32] Breast cancer classification using statistical features and fuzzy classification of thermograms
    Schaefer, Gerald
    Nakashima, Tomoharu
    Zavisek, Michal
    Yokota, Yasuyuki
    Drastich, Ales
    Ishibuchi, Hisao
    2007 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1-4, 2007, : 1101 - +
  • [33] Classification of complex networks based on similarity of topological network features
    Attar, Niousha
    Aliakbary, Sadegh
    CHAOS, 2017, 27 (09)
  • [34] Comparison of Ink Classification Capabilities of Classic Hyperspectral Similarity Features
    Devassy, Binu Melit
    George, Sony
    Hardeberg, Jon Y.
    2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW), VOL 8, 2019, : 25 - 30
  • [35] A vector classifier for sound similarity classification based on audio features
    Yilmazer, Cengiz
    Yilmazer, Semiha
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2023, 153 (03):
  • [36] Detecting Similarity of Transferring Datasets based on Features of Classification Rules
    Abe, Hidenao
    Tsumoto, Shusaku
    2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 412 - 415
  • [37] Similarity Measurement and Classification of English Characters Based on Language Features
    Miao, Linna
    Fang, Zhixin
    Zhang, Junping
    MOBILE INFORMATION SYSTEMS, 2022, 2022
  • [38] SIMILARITY-PRESERVING DEEP FEATURES FOR HYPERSPECTRAL IMAGE CLASSIFICATION
    Song, Weiwei
    Fang, Leyuan
    Li, Shutao
    IGARSS 2018 - 2018 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2018, : 3595 - 3598
  • [39] Proposal of a New Similarity Measure Based on Delay Embedding for Time Series Classification
    Chakraborty, Basabi
    Yoshida, Sho
    ADVANCES IN TIME SERIES ANALYSIS AND FORECASTING, 2017, : 271 - 284
  • [40] A Heuristic Video Recommendation Algorithm based on Similarity Computation for Multiple Features Analysis
    Li S.
    Recent Advances in Computer Science and Communications, 2022, 15 (08) : 1017 - 1025