A Text Similarity Measure Based on Suffix Tree

被引：0

作者：

Huang, Chenghui ^{[1
,2
]}

Liu, Yan ^{[3
]}

Xia, Shengzhong ^{[4
]}

Yin, Jian ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Dept Comp Sci, Guangzhou 510275, Guangdong, Peoples R China

[2] Guangdong Univ Finance, Dept Comp Sci & Technol, Guangzhou 510520, Guangdong, Peoples R China

[3] Guangdong Univ Finance, Dept Appl Math, Guangzhou 510520, Guangdong, Peoples R China

[4] Guangdong AIB Coll, Guangzhou 510507, Guangdong, Peoples R China

来源：

INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL | 2011年 / 14卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Similarity measure; Suffix tree; Document model; Text clustering;

D O I：

暂无

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

It is well known that most text clustering algorithms use the bag-of-words model, which represents a document as a vector. These methods ignored the word sequence information, and the good clustering result limited to some special domains. This paper presents a new text similarity algorithm (STSM) by applying the TF-IDF method to weight word sequences of a document that modeled as a suffix tree. Experimental results on standard document benchmark corpus RUTERS and BBC show that the new text similarity is effective. Comparing with the results of the state-of-art similarity measure, our proposed method brings an improvement of about 10% on the average of F-measure score.

引用

页码：583 / 592

页数：10

共 50 条

[41] A modified ant-based text clustering algorithm with semantic similarity measure
Haoxiang Xia
Shuguang Wang
Taketoshi Yoshida
Journal of Systems Science and Systems Engineering, 2006, 15 : 474 - 492
[42] Process-extraction-based text similarity measure for emergency response plans
Guo, Wenyan
Zeng, Qingtian
Duan, Hua
Ni, Weijian
Liu, Cong
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 183
[43] SyMSS: A syntax-based measure for short-text semantic similarity
Oliva, Jesus
Ignacio Serrano, Jose
Dolores del Castillo, Maria
Iglesias, Angel
DATA & KNOWLEDGE ENGINEERING, 2011, 70 (04) : 390 - 405
[44] A Mechanics-Based Similarity Measure for Text Classification in Machine Learning Paradigm
Kuppili, Venkatanareshbabu
Biswas, Mainak
Edla, Damodar Reddy
Prasad, K. J. Ravi
Suri, Jasjit S.
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2020, 4 (02): : 180 - 200
[45] Intrusion detection using text processing techniques with a kernel based similarity measure
Sharma, Alok
Pujari, Arun K.
Paliwal, Kuldip K.
COMPUTERS & SECURITY, 2007, 26 (7-8) : 488 - 495
[46] Process-extraction-based text similarity measure for emergency response plans
Guo, Wenyan
Zeng, Qingtian
Duan, Hua
Ni, Weijian
Liu, Cong
Zeng, Qingtian (qtzeng@163.com), 1600, Elsevier Ltd (183):
[47] The Comparation of Distance-Based Similarity Measure to Detection of Plagiarism in Indonesian Text
Mardiana, Tari
Adji, Teguh Bharata
Hidayah, Indriana
INTELLIGENCE IN THE ERA OF BIG DATA, ICSIIT 2015, 2015, 516 : 155 - 164
[48] A MODIFIED ANT-BASED TEXT CLUSTERING ALGORITHM WITH SEMANTIC SIMILARITY MEASURE
Taketoshi YOSHIDA
Journal of Systems Science and Systems Engineering, 2006, (04) : 474 - 492
[49] Text Representation and Similarity Measure for Text Clustering Based on Semantic Strings: A Case Study on Uyghur Language
Tohti, Turdi
Tan, Xing
Huang, Jimmy
Hamdulla, Askar
JOURNAL OF APPLIED SCIENCE AND ENGINEERING, 2021, 24 (03): : 339 - 350
[50] Suffix cactus: A cross between suffix tree and suffix array
Karkkainen, J
COMBINATORIAL PATTERN MATCHING, 1995, 937 : 191 - 204

← 1 2 3 4 5 →