A Text Similarity Measure Based on Suffix Tree

被引:0
|
作者
Huang, Chenghui [1 ,2 ]
Liu, Yan [3 ]
Xia, Shengzhong [4 ]
Yin, Jian [1 ]
机构
[1] Sun Yat Sen Univ, Dept Comp Sci, Guangzhou 510275, Guangdong, Peoples R China
[2] Guangdong Univ Finance, Dept Comp Sci & Technol, Guangzhou 510520, Guangdong, Peoples R China
[3] Guangdong Univ Finance, Dept Appl Math, Guangzhou 510520, Guangdong, Peoples R China
[4] Guangdong AIB Coll, Guangzhou 510507, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Similarity measure; Suffix tree; Document model; Text clustering;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
It is well known that most text clustering algorithms use the bag-of-words model, which represents a document as a vector. These methods ignored the word sequence information, and the good clustering result limited to some special domains. This paper presents a new text similarity algorithm (STSM) by applying the TF-IDF method to weight word sequences of a document that modeled as a suffix tree. Experimental results on standard document benchmark corpus RUTERS and BBC show that the new text similarity is effective. Comparing with the results of the state-of-art similarity measure, our proposed method brings an improvement of about 10% on the average of F-measure score.
引用
收藏
页码:583 / 592
页数:10
相关论文
共 50 条
  • [31] Tree-structured Curriculum Learning based on Semantic Similarity of Text
    Han, Sanggyu
    Myaeng, Sung-Hyon
    2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 971 - 976
  • [32] Feature trees: A new molecular similarity measure based on tree matching
    Matthias Rarey
    J. Scott Dixon
    Journal of Computer-Aided Molecular Design, 1998, 12 : 471 - 490
  • [33] Feature trees: A new molecular similarity measure based on tree matching
    Rarey, M
    Dixon, JS
    JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 1998, 12 (05) : 471 - 490
  • [34] METHOD OF ANNOTATED SUFFIX TREE FOR SCORING THE EXTENT OF PRESENCE OF A STRING IN TEXT
    Mirkin, B.
    Chernyak, E.
    Chugunova, O.
    BIZNES INFORMATIKA-BUSINESS INFORMATICS, 2012, 21 (03): : 31 - +
  • [35] An Improved Similarity Measure for Text Clustering and Classification
    Reddy, G. Suresh
    Kanth, T. V. Rajini
    Rao, A. Ananda
    ADVANCED SCIENCE LETTERS, 2015, 21 (11) : 3583 - 3590
  • [36] An improved Similarity Measure For Chinese Text Clustering
    Zhang, Shaolei
    Wang, Zhong
    Huang, Wei
    2016 2ND INTERNATIONAL CONFERENCE ON MECHANICAL, ELECTRONIC AND INFORMATION TECHNOLOGY ENGINEERING (ICMITE 2016), 2016, : 141 - 144
  • [37] A Comment on "A Similarity Measure for Text Classification and Clustering"
    Nagwani, Naresh Kumar
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (09) : 2589 - 2590
  • [38] Consensus Similarity Measure for Short Text Clustering
    Shin, Youhyun
    Ahn, Yeonchan
    Jeon, Heesik
    Lee, Sang-goo
    2015 26TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA), 2015, : 264 - 268
  • [39] A MODIFIED ANT-BASED TEXT CLUSTERING ALGORITHM WITH SEMANTIC SIMILARITY MEASURE
    Xia, Haoxiang
    Wang, Shuguang
    Yoshida, Taketoshi
    JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING, 2006, 15 (04) : 474 - 492
  • [40] Text Similarity Approach for SNOMED CT Primitive Concept Similarity Measure
    Htun, Htet Htet
    Sornlertlamvanich, Virach
    2017 8TH INTERNATIONAL CONFERENCE OF INFORMATION AND COMMUNICATION TECHNOLOGY FOR EMBEDDED SYSTEMS (IC-ICTES), 2017,