Improving document clustering using Okapi BM25 feature weighting

被引:32
|
作者
Whissell, John S. [1 ]
Clarke, Charles L. A. [1 ]
机构
[1] Univ Waterloo, David R Cheriton Sch Comp Sci, Waterloo, ON N2L 3G1, Canada
来源
INFORMATION RETRIEVAL | 2011年 / 14卷 / 05期
关键词
Document clustering; Feature weighting; Okapi BM25;
D O I
10.1007/s10791-011-9163-y
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We investigate the effect of feature weighting on document clustering, including a novel investigation of Okapi BM25 feature weighting. Using eight document datasets and 17 well-established clustering algorithms we show that the benefit of tf-idf weighting over tf weighting is heavily dependent on both the dataset being clustered and the algorithm used. In addition, binary weighting is shown to be consistently inferior to both tf-idf weighting and tf weighting. We investigate clustering using both BM25 term saturation in isolation and BM25 term saturation with idf, confirming that both are superior to their non-BM25 counterparts under several common clustering quality measures. Finally, we investigate estimation of the k1 BM25 parameter when clustering. Our results indicate that typical values of k1 from other IR tasks are not appropriate for clustering; k1 needs to be higher.
引用
收藏
页码:466 / 487
页数:22
相关论文
共 50 条
  • [1] Improving document clustering using Okapi BM25 feature weighting
    John S. Whissell
    Charles L. A. Clarke
    [J]. Information Retrieval, 2011, 14 : 466 - 487
  • [2] BM25-CTF: Improving TF and IDF factors in BM25 by using collection term frequencies
    Jimenez, Sergio
    Cucerzan, Silviu-Petru
    Gonzalez, Fabio A.
    Gelbukh, Alexander
    Duenas, George
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2018, 34 (05) : 2887 - 2899
  • [3] Improving the Sentiment Analysis Process of Spanish Tweets with BM25
    Sixto, Juan
    Almeida, Aitor
    Lopez-de-Ipina, Diego
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, NLDB 2016, 2016, 9612 : 285 - 291
  • [4] INCREMENTAL CLUSTERING IN SHORT TEXT STREAMS BASED ON BM25
    Xu, Lixin
    Chen, Guang
    Yang, Lei
    [J]. 2014 IEEE 3RD INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2014, : 8 - 12
  • [5] Duplication Detection for Software Bug Reports based on BM25 Term Weighting
    Yang, Cheng-Zen
    Du, Hung-Hsueh
    Wu, Sin-Sian
    Chen, Ing-Xiang
    [J]. 2012 CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI), 2012, : 33 - 38
  • [6] Document clustering using sample weighting
    Zhang, Chengzhi
    Su, Xinning
    Zhou, Dongmin
    [J]. RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 260 - 265
  • [7] Learning to Rank for Determining Relevant Document in Indonesian-English Cross Language Information Retrieval using BM25
    Sari, Syandra
    Adriani, Mirna
    [J]. 2014 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2014, : 309 - 314
  • [8] Course Recommendation by Improving BM25 to Identity Students' Different Levels of Interests in Courses
    Wang, Xin
    Yuan, Fang
    [J]. 2009 INTERNATIONAL CONFERENCE ON NEW TRENDS IN INFORMATION AND SERVICE SCIENCE (NISS 2009), VOLS 1 AND 2, 2009, : 1372 - 1377
  • [9] Feature weighting for improving document image retrieval system performance
    [J]. Keyvanpour, M., 1600, International Journal of Computer Science Issues (IJCSI) (09): : 3 - 3
  • [10] A COMBINATION WEIGHTING ALGORITHM USING RELATIVE ENTROPY FOR DOCUMENT CLUSTERING
    Ji, Bo
    Ye, Yangdong
    Xiao, Yu
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2014, 28 (03)