A Text Similarity Measure Based on Suffix Tree

被引：0

作者：

Huang, Chenghui ^{[1
,2
]}

Liu, Yan ^{[3
]}

Xia, Shengzhong ^{[4
]}

Yin, Jian ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Dept Comp Sci, Guangzhou 510275, Guangdong, Peoples R China

[2] Guangdong Univ Finance, Dept Comp Sci & Technol, Guangzhou 510520, Guangdong, Peoples R China

[3] Guangdong Univ Finance, Dept Appl Math, Guangzhou 510520, Guangdong, Peoples R China

[4] Guangdong AIB Coll, Guangzhou 510507, Guangdong, Peoples R China

来源：

INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL | 2011年 / 14卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Similarity measure; Suffix tree; Document model; Text clustering;

D O I：

暂无

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

It is well known that most text clustering algorithms use the bag-of-words model, which represents a document as a vector. These methods ignored the word sequence information, and the good clustering result limited to some special domains. This paper presents a new text similarity algorithm (STSM) by applying the TF-IDF method to weight word sequences of a document that modeled as a suffix tree. Experimental results on standard document benchmark corpus RUTERS and BBC show that the new text similarity is effective. Comparing with the results of the state-of-art similarity measure, our proposed method brings an improvement of about 10% on the average of F-measure score.

引用

页码：583 / 592

页数：10

共 50 条

[31] Tree-structured Curriculum Learning based on Semantic Similarity of Text
Han, Sanggyu
Myaeng, Sung-Hyon
2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 971 - 976
[32] Feature trees: A new molecular similarity measure based on tree matching
Matthias Rarey
J. Scott Dixon
Journal of Computer-Aided Molecular Design, 1998, 12 : 471 - 490
[33] Feature trees: A new molecular similarity measure based on tree matching
Rarey, M
Dixon, JS
JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 1998, 12 (05) : 471 - 490
[34] METHOD OF ANNOTATED SUFFIX TREE FOR SCORING THE EXTENT OF PRESENCE OF A STRING IN TEXT
Mirkin, B.
Chernyak, E.
Chugunova, O.
BIZNES INFORMATIKA-BUSINESS INFORMATICS, 2012, 21 (03): : 31 - +
[35] An Improved Similarity Measure for Text Clustering and Classification
Reddy, G. Suresh
Kanth, T. V. Rajini
Rao, A. Ananda
ADVANCED SCIENCE LETTERS, 2015, 21 (11) : 3583 - 3590
[36] An improved Similarity Measure For Chinese Text Clustering
Zhang, Shaolei
Wang, Zhong
Huang, Wei
2016 2ND INTERNATIONAL CONFERENCE ON MECHANICAL, ELECTRONIC AND INFORMATION TECHNOLOGY ENGINEERING (ICMITE 2016), 2016, : 141 - 144
[37] A Comment on "A Similarity Measure for Text Classification and Clustering"
Nagwani, Naresh Kumar
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (09) : 2589 - 2590
[38] Consensus Similarity Measure for Short Text Clustering
Shin, Youhyun
Ahn, Yeonchan
Jeon, Heesik
Lee, Sang-goo
2015 26TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA), 2015, : 264 - 268
[39] A MODIFIED ANT-BASED TEXT CLUSTERING ALGORITHM WITH SEMANTIC SIMILARITY MEASURE
Xia, Haoxiang
Wang, Shuguang
Yoshida, Taketoshi
JOURNAL OF SYSTEMS SCIENCE AND SYSTEMS ENGINEERING, 2006, 15 (04) : 474 - 492
[40] Text Similarity Approach for SNOMED CT Primitive Concept Similarity Measure
Htun, Htet Htet
Sornlertlamvanich, Virach
2017 8TH INTERNATIONAL CONFERENCE OF INFORMATION AND COMMUNICATION TECHNOLOGY FOR EMBEDDED SYSTEMS (IC-ICTES), 2017,

← 1 2 3 4 5 →