A Text Similarity Measure Based on Suffix Tree

被引:0
|
作者
Huang, Chenghui [1 ,2 ]
Liu, Yan [3 ]
Xia, Shengzhong [4 ]
Yin, Jian [1 ]
机构
[1] Sun Yat Sen Univ, Dept Comp Sci, Guangzhou 510275, Guangdong, Peoples R China
[2] Guangdong Univ Finance, Dept Comp Sci & Technol, Guangzhou 510520, Guangdong, Peoples R China
[3] Guangdong Univ Finance, Dept Appl Math, Guangzhou 510520, Guangdong, Peoples R China
[4] Guangdong AIB Coll, Guangzhou 510507, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Similarity measure; Suffix tree; Document model; Text clustering;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
It is well known that most text clustering algorithms use the bag-of-words model, which represents a document as a vector. These methods ignored the word sequence information, and the good clustering result limited to some special domains. This paper presents a new text similarity algorithm (STSM) by applying the TF-IDF method to weight word sequences of a document that modeled as a suffix tree. Experimental results on standard document benchmark corpus RUTERS and BBC show that the new text similarity is effective. Comparing with the results of the state-of-art similarity measure, our proposed method brings an improvement of about 10% on the average of F-measure score.
引用
收藏
页码:583 / 592
页数:10
相关论文
共 50 条
  • [41] A modified ant-based text clustering algorithm with semantic similarity measure
    Haoxiang Xia
    Shuguang Wang
    Taketoshi Yoshida
    Journal of Systems Science and Systems Engineering, 2006, 15 : 474 - 492
  • [42] Process-extraction-based text similarity measure for emergency response plans
    Guo, Wenyan
    Zeng, Qingtian
    Duan, Hua
    Ni, Weijian
    Liu, Cong
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 183
  • [43] SyMSS: A syntax-based measure for short-text semantic similarity
    Oliva, Jesus
    Ignacio Serrano, Jose
    Dolores del Castillo, Maria
    Iglesias, Angel
    DATA & KNOWLEDGE ENGINEERING, 2011, 70 (04) : 390 - 405
  • [44] A Mechanics-Based Similarity Measure for Text Classification in Machine Learning Paradigm
    Kuppili, Venkatanareshbabu
    Biswas, Mainak
    Edla, Damodar Reddy
    Prasad, K. J. Ravi
    Suri, Jasjit S.
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2020, 4 (02): : 180 - 200
  • [45] Intrusion detection using text processing techniques with a kernel based similarity measure
    Sharma, Alok
    Pujari, Arun K.
    Paliwal, Kuldip K.
    COMPUTERS & SECURITY, 2007, 26 (7-8) : 488 - 495
  • [46] Process-extraction-based text similarity measure for emergency response plans
    Guo, Wenyan
    Zeng, Qingtian
    Duan, Hua
    Ni, Weijian
    Liu, Cong
    Zeng, Qingtian (qtzeng@163.com), 1600, Elsevier Ltd (183):
  • [47] The Comparation of Distance-Based Similarity Measure to Detection of Plagiarism in Indonesian Text
    Mardiana, Tari
    Adji, Teguh Bharata
    Hidayah, Indriana
    INTELLIGENCE IN THE ERA OF BIG DATA, ICSIIT 2015, 2015, 516 : 155 - 164
  • [48] A MODIFIED ANT-BASED TEXT CLUSTERING ALGORITHM WITH SEMANTIC SIMILARITY MEASURE
    Taketoshi YOSHIDA
    Journal of Systems Science and Systems Engineering, 2006, (04) : 474 - 492
  • [49] Text Representation and Similarity Measure for Text Clustering Based on Semantic Strings: A Case Study on Uyghur Language
    Tohti, Turdi
    Tan, Xing
    Huang, Jimmy
    Hamdulla, Askar
    JOURNAL OF APPLIED SCIENCE AND ENGINEERING, 2021, 24 (03): : 339 - 350
  • [50] Suffix cactus: A cross between suffix tree and suffix array
    Karkkainen, J
    COMBINATORIAL PATTERN MATCHING, 1995, 937 : 191 - 204