Term frequency - function of document frequency: a new term weighting scheme for enterprise information retrieval

被引:14
|
作者
Zhang, Hui [1 ]
Wang, Deqing [1 ]
Wu, Wenjun [1 ]
Hu, Hongping [1 ]
机构
[1] Beihang Univ, Sch Comp Sci, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
基金
中国国家自然科学基金;
关键词
enterprise information retrieval; term weighting scheme; term frequency; function of document frequency; relevance ranking; PERFORMANCE;
D O I
10.1080/17517575.2012.665945
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In today's business environment, enterprises are increasingly under pressure to process the vast amount of data produced everyday within enterprises. One method is to focus on the business intelligence (BI) applications and increasing the commercial added-value through such business analytics activities. Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital task in enterprise Information Retrieval (IR), text categorisation, text analytics, etc. When determining term weight in a document, the traditional TF-IDF scheme sets weight value for the term considering only its occurrence frequency within the document and in the entire set of documents, which leads to some meaningful terms that cannot get the appropriate weight. In this article, we propose a new term weighting scheme called Term Frequency Function of Document Frequency (TF-FDF) to address this issue. Instead of using monotonically decreasing function such as Inverse Document Frequency, FDF presents a convex function that dynamically adjusts weights according to the significance of the words in a document set. This function can be manually tuned based on the distribution of the most meaningful words which semantically represent the document set. Our experiments show that the TF-FDF can achieve higher value of Normalised Discounted Cumulative Gain in IR than that of TF-IDF and its variants, and improving the accuracy of relevance ranking of the IR results.
引用
收藏
页码:433 / 444
页数:12
相关论文
共 50 条
  • [1] A New Term Weighting Scheme Based on Class Specific Document Frequency for Document Representation and Classification
    Plansangket, Suthira
    Gan, John Q.
    2015 7TH COMPUTER SCIENCE AND ELECTRONIC ENGINEERING CONFERENCE (CEEC), 2015, : 5 - 8
  • [2] Improving Information Retrieval Through a Global Term Weighting Scheme
    Cuellar, Daniel
    Diaz, Elva
    Ponce-de-Leon-Senti, Eunice
    PATTERN RECOGNITION (MCPR 2015), 2015, 9116 : 246 - 257
  • [3] Embedding term similarity and inverse document frequency into a logical model of information retrieval
    Losada, DE
    Barreiro, A
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2003, 54 (04): : 285 - 301
  • [4] A New Term Frequency Normalization Model for Probabilistic Information Retrieval
    Jian, Fanghong
    Huang, Jimmy Xiangji
    Zhao, Jiashu
    He, Tingting
    ACM/SIGIR PROCEEDINGS 2018, 2018, : 1237 - 1240
  • [5] Term frequency with average term occurrences for textual information retrieval
    O. Ali Sadek Ibrahim
    D. Landa-Silva
    Soft Computing, 2016, 20 : 3045 - 3061
  • [6] Term frequency with average term occurrences for textual information retrieval
    Ibrahim, O. Ali Sadek
    Landa-Silva, D.
    SOFT COMPUTING, 2016, 20 (08) : 3045 - 3061
  • [7] A Part-Of-Speech term weighting scheme for biomedical information retrieval
    Wang, Yanshan
    Wu, Stephen
    Li, Dingcheng
    Mehrabi, Saeed
    Liu, Hongfang
    JOURNAL OF BIOMEDICAL INFORMATICS, 2016, 63 : 379 - 389
  • [8] A new document representation using term frequency and vectorized graph connectionists with application to document retrieval
    Chow, Tommy W. S.
    Zhang, Haijun
    Rahman, M. K. M.
    EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (10) : 12023 - 12035
  • [9] Customized term weighting scheme for document classification
    Benjamin, C. M. X.
    Woon, W. L.
    Wong, K. S. D.
    2008 INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING, VOLS 1-3, 2008, : 294 - 299
  • [10] A selective approach to index term weighting for robust information retrieval based on the frequency distributions of query terms
    Ahmet Arslan
    Bekir Taner Dinçer
    Information Retrieval Journal, 2019, 22 : 543 - 569