INCREMENTAL CLUSTERING IN SHORT TEXT STREAMS BASED ON BM25

被引:0
|
作者
Xu, Lixin [1 ]
Chen, Guang [1 ]
Yang, Lei [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Beijing 100876, Peoples R China
基金
中国国家自然科学基金;
关键词
Short text stream; Incremental clustering; BM25; Cluster cohesion; Keyword similarity;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Since short text is short of keywords and has sparse features, it brings about the similarity drift problem. The traditional clustering algorithms are usually ineffective and a waste of resources on dealing with short text stream. To overcome the above problems, this paper proposes an incremental clustering algorithm in short text streams based on BM25. The approach makes full use of BM25 to extract keywords and weights of each cluster, and applies extracted parameters to similarity calculation. Theoretical analysis and experiments show that the proposed incremental clustering algorithm solves the similarity drift problem well and achieves satisfactory accuracy and performance in terms of short text stream clustering, compared with the traditional clustering algorithms.
引用
收藏
页码:8 / 12
页数:5
相关论文
共 50 条
  • [21] Improving the Sentiment Analysis Process of Spanish Tweets with BM25
    Sixto, Juan
    Almeida, Aitor
    Lopez-de-Ipina, Diego
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, NLDB 2016, 2016, 9612 : 285 - 291
  • [22] Efficient Hyperparameter Tuning with Grid Search for Text Categorization using kNN Approach with BM25 Similarity
    Ghawi, Raji
    Pfeffer, Juergen
    [J]. OPEN COMPUTER SCIENCE, 2019, 9 (01): : 160 - 180
  • [23] Legal Information Retrieval and Entailment Based on BM25, Transformer and Semantic Thesaurus Methods
    Kim, Mi-Young
    Rabelo, Juliano
    Okeke, Kingsley
    Goebel, Randy
    [J]. REVIEW OF SOCIONETWORK STRATEGIES, 2022, 16 (01): : 157 - 174
  • [24] Term frequency normalisation tuning for BM25 and DFR models
    He, B
    Ounis, I
    [J]. ADVANCES IN INFORMATION RETRIEVAL, 2005, 3408 : 200 - 214
  • [25] BM25-AH: Enhanced BM25 Algorithm for Domain-Specific Search Engine
    Kalian, Kirk
    Remig, Charles
    Jung, Youna
    [J]. IIWAS2019: THE 21ST INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES, 2019, : 631 - 634
  • [26] BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies
    Jimenez, Sergio
    Cucerzan, Silviu-Petru
    Gonzalez, Fabio A.
    Gelbukh, Alexander
    Duenas, George
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2018, 34 (05) : 2887 - 2899
  • [27] Efficient Clustering of Short Text Streams using Online-Offline Clustering
    Rakib, Md Rashadul Hasan
    Zeh, Norbert
    Milios, Evangelos
    [J]. PROCEEDINGS OF THE 21ST ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG '21), 2021,
  • [28] Legal Information Retrieval and Entailment Based on BM25, Transformer and Semantic Thesaurus Methods
    Mi-Young Kim
    Juliano Rabelo
    Kingsley Okeke
    Randy Goebel
    [J]. The Review of Socionetwork Strategies, 2022, 16 : 157 - 174
  • [29] Term Impacts as Normalized Term Frequencies for BM25 Similarity Scoring
    Anh, Vo Ngoc
    Wan, Raymond
    Moffat, Alistair
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2008, 5280 : 51 - +
  • [30] Choosing Math Features for BM25 Ranking with Tangent-L
    Fraser, Dallas
    Kane, Andrew
    Tompa, Frank Wm.
    [J]. PROCEEDINGS OF THE ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG 2018), 2018,