A Text Similarity Measure Based on Suffix Tree

被引:0
|
作者
Huang, Chenghui [1 ,2 ]
Liu, Yan [3 ]
Xia, Shengzhong [4 ]
Yin, Jian [1 ]
机构
[1] Sun Yat Sen Univ, Dept Comp Sci, Guangzhou 510275, Guangdong, Peoples R China
[2] Guangdong Univ Finance, Dept Comp Sci & Technol, Guangzhou 510520, Guangdong, Peoples R China
[3] Guangdong Univ Finance, Dept Appl Math, Guangzhou 510520, Guangdong, Peoples R China
[4] Guangdong AIB Coll, Guangzhou 510507, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Similarity measure; Suffix tree; Document model; Text clustering;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
It is well known that most text clustering algorithms use the bag-of-words model, which represents a document as a vector. These methods ignored the word sequence information, and the good clustering result limited to some special domains. This paper presents a new text similarity algorithm (STSM) by applying the TF-IDF method to weight word sequences of a document that modeled as a suffix tree. Experimental results on standard document benchmark corpus RUTERS and BBC show that the new text similarity is effective. Comparing with the results of the state-of-art similarity measure, our proposed method brings an improvement of about 10% on the average of F-measure score.
引用
收藏
页码:583 / 592
页数:10
相关论文
共 50 条
  • [1] An Improved Text Retrieval Algorithm Based on Suffix Tree Similarity Measure
    Huang, Cheng-hui
    Yin, Jian
    Han, Dong
    INFORMATION COMPUTING AND APPLICATIONS, PT 2, 2010, 106 : 150 - +
  • [2] Text clustering using a suffix tree similarity measure
    Huang C.
    Yin J.
    Hou F.
    Journal of Computers, 2011, 6 (10) : 2180 - 2186
  • [3] Using Annotated Suffix Tree Similarity Measure for Text Summarisation
    Yakovlev, Maxim
    Chernyak, Ekaterina
    ANALYSIS OF LARGE AND COMPLEX DATA, 2016, : 103 - 112
  • [4] A New Suffix Tree Similarity Measure and Labeling for Web Search Results Clusteringa
    Kale, Archana
    Bharambe, Ujwala
    SashiKumar, M.
    2009 SECOND INTERNATIONAL CONFERENCE ON EMERGING TRENDS IN ENGINEERING AND TECHNOLOGY (ICETET 2009), 2009, : 1148 - +
  • [5] A Suffix Tree Or Not a Suffix Tree?
    Starikovskaya, Tatiana
    Vildhoj, Hjalte Wedel
    COMBINATORIAL ALGORITHMS, IWOCA 2014, 2015, 8986 : 338 - 350
  • [6] A suffix tree or not a suffix tree?
    Starikovskaya, Tatiana
    Vildhoj, Hjalte Wedel
    JOURNAL OF DISCRETE ALGORITHMS, 2015, 32 : 14 - 23
  • [7] TextFlow: A Text Similarity Measure based on Continuous Sequences
    Mrabet, Yassine
    Kilicoglu, Halil
    Demner-Fushman, Dina
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 763 - 772
  • [8] A Short Text Similarity Measure Based on Hidden Topics
    Chen, Hong-chao
    Guo, Xiao-hua
    Liu, Ling-qiang
    Zhu, Xin-hua
    COMPUTER SCIENCE AND TECHNOLOGY (CST2016), 2017, : 1101 - 1108
  • [9] Text generation by probabilistic suffix tree language model
    Marukatat, Sanparith
    16TH INTERNATIONAL JOINT SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE PROCESSING (ISAI-NLP 2021), 2021,
  • [10] Semantic Similarity Measure Based on Ontology Hierarchical Tree
    Ge, Jike
    Qiu, Yuhui
    Yin, Shiqun
    Chen, Zuqin
    2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 5290 - 5294