Proposal and study of statistical features for string similarity computation and classification

被引:2
|
作者
Rodrigues, E. O. [1 ]
Casanova, D. [1 ]
Teixeira, M. [1 ]
Pegorini, V [1 ]
Favarim, F. [1 ]
Clua, E. [2 ]
Conci, A. [2 ]
Liatsis, Panos [3 ]
机构
[1] Univ Tecnol Fed Parana UTFPR, Acad Dept Informat, Apucarana, Parana, Brazil
[2] Univ Fed Fluminense UFF, Dept Comp Sci, Rio De Janeiro, Brazil
[3] Khalifa Univ, Dept Elect Engn & Comp Sci, Abu Dhabi, U Arab Emirates
关键词
word comparison; string similarity; classification; statistical features; text mining; optical character recognition; OCR; text plagiarism; text entailment; supervised learning;
D O I
10.1504/IJDMMM.2020.108731
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.
引用
收藏
页码:277 / 307
页数:31
相关论文
共 50 条
  • [41] A computation study on semantics based weighted sentence similarity
    Li, Dongmei
    Hou, Jiajia
    Hao, Shudong
    Li, Na
    Zhang, Bo
    Journal of Chemical and Pharmaceutical Research, 2013, 5 (08) : 225 - 231
  • [42] Statistical framework for image retrieval based on multiresolution features and similarity method
    K. Seetharaman
    M. Kamarasan
    Multimedia Tools and Applications, 2014, 73 : 1943 - 1962
  • [43] Using multiple features and statistical model to calculate text units similarity
    Xu, YD
    Xu, ZM
    Wang, XL
    Liu, YC
    Liu, T
    Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vols 1-9, 2005, : 3834 - 3839
  • [44] Measuring Musical Rhythm Similarity: Statistical Features Versus Transformation Methods
    Beltran, Juan F.
    Liu, Xiaohua
    Mohanchandra, Nishant
    Toussaint, Godfried T.
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2015, 29 (02)
  • [45] Statistical framework for image retrieval based on multiresolution features and similarity method
    Seetharaman, K.
    Kamarasan, M.
    MULTIMEDIA TOOLS AND APPLICATIONS, 2014, 73 (03) : 1943 - 1962
  • [46] Sentiment Classification of Financial News Using Statistical Features
    Yazdani, Sepideh Foroozan
    Murad, Masrah Azrifah Azmi
    Sharef, Nurfadhlina Mohd
    Singh, Yashwant Prasad
    Latiff, Ahmed Razman Abdul
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2017, 31 (03)
  • [47] Metric Learning With Statistical Features For Network Traffic Classification
    Zhang, Ziqing
    Kang, Cuicui
    Fu, Peipei
    Cao, Zigang
    Li, Zhen
    Xiong, Gang
    2017 IEEE 36TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2017,
  • [48] Local Wavelet Features for Statistical Object Classification and Localization
    Grzegorzek, Marcin
    Sav, Sorin
    O'Connor, Noel E.
    Izquierdo, Ebroul
    IEEE MULTIMEDIA, 2010, 17 (01) : 56 - 66
  • [49] A similarity study of I/O traces via string kernels
    Raul Torres
    Julian M. Kunkel
    Manuel F. Dolz
    Thomas Ludwig
    The Journal of Supercomputing, 2019, 75 : 7814 - 7826
  • [50] Deep and Statistical Features Classification Model for Electroencephalography Signals
    Karaduman, Mucahit
    Karci, Ali
    TRAITEMENT DU SIGNAL, 2022, 39 (05) : 1517 - 1525