A topic-based term frequency normalization framework to enhance probabilistic information retrieval

被引:5
|
作者
Jian, Fanghong [1 ,2 ]
Huang, Jimmy X. [3 ]
Zhao, Jiashu [4 ]
Ying, Zhiwei [3 ]
Wang, Yuqi [3 ]
机构
[1] Cent China Normal Univ, Natl Engn Res Ctr ELearning, Wuhan, Hubei, Peoples R China
[2] Jiujiang Univ, Sch Sci, Jiujiang, Peoples R China
[3] York Univ, Sch Informat Technol, Technol Enhanced Learning Bldg, Toronto, ON M3J 1P3, Canada
[4] Wilfrid Laurier Univ, Dept Phys & Comp Sci, Waterloo, ON, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Dirichlet language model; embedding; LDA; probabilistic model; term frequency normalization; topic modeling; LANGUAGE MODELS;
D O I
10.1111/coin.12248
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many well-known probabilistic information retrieval models have shown promise for use in document ranking, especially BM25. Nevertheless, it is observed that the control parameters in BM25 usually need to be adjusted to achieve improved performance on different data sets; additionally, the assumption in BM25 on the bag-of-words model prevents its direct utilization of rich information that lies at the sentence or document level. Inspired by the above challenges with respect to BM25, we first propose a new normalization method on the term frequency in BM25 (called BM25(QL) in this paper); in addition, the method is incorporated into CRTER2, a recent BM25-based model, to construct CRTER2QL. Then, we incorporate topic modeling and word embedding into BM25 to relax the assumption of the bag-of-words model. In this direction, we propose a topic-based retrieval model, TopTF, for BM25, which is then further incorporated into the language model (LM) and the multiple aspect term frequency (MATF) model. Furthermore, an enhanced topic-based term frequency normalization framework, ETopTF, based on embedding is presented. Experimental studies demonstrate the great effectiveness and performance of these methods. Specifically, on all tested data sets and in terms of the mean average precision (MAP), our proposed models, BM25(QL) and CRTER2QL, are comparable to BM25 and CRTER2 with the best b parameter value; the TopTF models significantly outperform the baselines, and the ETopTF models could further improve the TopTF in terms of the MAP.
引用
收藏
页码:486 / 521
页数:36
相关论文
共 50 条
  • [1] A Probabilistic Topic-Based Ranking Framework for Location-Sensitive Domain Information Retrieval
    Li, Huajing
    Li, Zhisheng
    Lee, Wang-Chien
    Lee, Dik Lun
    [J]. PROCEEDINGS 32ND ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2009, : 331 - 338
  • [2] A New Term Frequency Normalization Model for Probabilistic Information Retrieval
    Jian, Fanghong
    Huang, Jimmy Xiangji
    Zhao, Jiashu
    He, Tingting
    [J]. ACM/SIGIR PROCEEDINGS 2018, 2018, : 1237 - 1240
  • [3] Rewarding Term Location Information to Enhance Probabilistic Information Retrieval
    Zhao, Jiashu
    Huang, Jimmy Xiangji
    Wu, Shicheng
    [J]. SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2012, : 1137 - 1138
  • [4] Using Term Location Information to Enhance Probabilistic Information Retrieval
    Liu, Baiyan
    An, Xiangdong
    Huang, Jimmy Xiangji
    [J]. SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2015, : 883 - 886
  • [5] A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval
    Baillie, Mark
    Carman, Mark J.
    Crestani, Fabio
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5478 : 485 - +
  • [6] A Topic-based Document Retrieval System Architecture
    Jia, Xiping
    [J]. 2010 THE 3RD INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INDUSTRIAL APPLICATION (PACIIA2010), VOL VIII, 2010, : 80 - 83
  • [7] A Topic-based Document Retrieval System Architecture
    Jia, Xiping
    [J]. 2011 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTATION AND INDUSTRIAL APPLICATION (ICIA2011), VOL III, 2011, : 80 - 83
  • [8] Managing word mismatch problems in information retrieval: A topic-based query expansion approach
    Wei, Chih-Ping
    Hu, Paul Jen-Hwa
    Tai, Chia-Hung
    Huang, Chun-Neng
    Yang, Chin-Sheng
    [J]. JOURNAL OF MANAGEMENT INFORMATION SYSTEMS, 2007, 24 (03) : 269 - 295
  • [9] Topic-based ranking in Folksonomy via probabilistic model
    Yan’an Jin
    Ruixuan Li
    Kunmei Wen
    Xiwu Gu
    Fei Xiao
    [J]. Artificial Intelligence Review, 2011, 36 : 139 - 151
  • [10] Topic-based ranking in Folksonomy via probabilistic model
    Jin, Yan'an
    Li, Ruixuan
    Wen, Kunmei
    Gu, Xiwu
    Xiao, Fei
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2011, 36 (02) : 139 - 151