Proposal and study of statistical features for string similarity computation and classification

被引:2
|
作者
Rodrigues, E. O. [1 ]
Casanova, D. [1 ]
Teixeira, M. [1 ]
Pegorini, V [1 ]
Favarim, F. [1 ]
Clua, E. [2 ]
Conci, A. [2 ]
Liatsis, Panos [3 ]
机构
[1] Univ Tecnol Fed Parana UTFPR, Acad Dept Informat, Apucarana, Parana, Brazil
[2] Univ Fed Fluminense UFF, Dept Comp Sci, Rio De Janeiro, Brazil
[3] Khalifa Univ, Dept Elect Engn & Comp Sci, Abu Dhabi, U Arab Emirates
关键词
word comparison; string similarity; classification; statistical features; text mining; optical character recognition; OCR; text plagiarism; text entailment; supervised learning;
D O I
10.1504/IJDMMM.2020.108731
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.
引用
收藏
页码:277 / 307
页数:31
相关论文
共 50 条
  • [1] A smoothing method for a statistical string similarity
    Takasu, Atsuhiro
    Aihara, Kenro
    Yamada, Taizo
    IRI 2007: PROCEEDINGS OF THE 2007 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2007, : 624 - +
  • [2] Statistical string similarity model for information linkage
    Takasu, Atsuhiro
    Progress in Informatics, 2009, (06): : 57 - 62
  • [3] QSJoin: a new string similarity join method based on Q-sample and statistical features
    Wang, Xiaoxia
    Sun, Decai
    Wu, Bo
    Ji, Puzhao
    INTERNATIONAL JOURNAL OF ARTS AND TECHNOLOGY, 2019, 11 (03) : 285 - 308
  • [4] String similarity algorithms for a ticket classification system
    Pikies, Malgorzata
    Ali, Junade
    2019 6TH INTERNATIONAL CONFERENCE ON CONTROL, DECISION AND INFORMATION TECHNOLOGIES (CODIT 2019), 2019, : 36 - 41
  • [5] Computation of generic features for object classification
    Hall, D
    Crowley, JL
    SCALE SPACE METHODS IN COMPUTER VISION, PROCEEDINGS, 2003, 2695 : 744 - 756
  • [6] A proposal for annotation, semantic similarity and classification of textual documents
    Nauer, Emmanuel
    Napoli, Amedeo
    ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2006, 4183 : 201 - 212
  • [7] Statistical features and perceived similarity of folk melodies
    Eerola, T
    Järvinen, T
    Louhivuori, J
    Toiviainen, P
    MUSIC PERCEPTION, 2001, 18 (03): : 275 - 296
  • [8] Entity Matching with String Transformation and Similarity-Based Features
    Sakai, Kazunori
    Dong, Yuyang
    Oyamada, Masafumi
    Takeoka, Kunihiro
    Okadome, Takeshi
    SOFTWARE FOUNDATIONS FOR DATA INTEROPERABILITY, SFDI 2021, 2022, 1457 : 76 - 87
  • [9] STATISTICAL GEOMETRICAL FEATURES FOR TEXTURE CLASSIFICATION
    CHEN, YQ
    NIXON, MS
    THOMAS, DW
    PATTERN RECOGNITION, 1995, 28 (04) : 537 - 552
  • [10] Statistical landscape features for texture classification
    Xu, CL
    Chen, YQ
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, 2004, : 676 - 679