Term frequency - function of document frequency: a new term weighting scheme for enterprise information retrieval

被引:14
|
作者
Zhang, Hui [1 ]
Wang, Deqing [1 ]
Wu, Wenjun [1 ]
Hu, Hongping [1 ]
机构
[1] Beihang Univ, Sch Comp Sci, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
基金
中国国家自然科学基金;
关键词
enterprise information retrieval; term weighting scheme; term frequency; function of document frequency; relevance ranking; PERFORMANCE;
D O I
10.1080/17517575.2012.665945
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In today's business environment, enterprises are increasingly under pressure to process the vast amount of data produced everyday within enterprises. One method is to focus on the business intelligence (BI) applications and increasing the commercial added-value through such business analytics activities. Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital task in enterprise Information Retrieval (IR), text categorisation, text analytics, etc. When determining term weight in a document, the traditional TF-IDF scheme sets weight value for the term considering only its occurrence frequency within the document and in the entire set of documents, which leads to some meaningful terms that cannot get the appropriate weight. In this article, we propose a new term weighting scheme called Term Frequency Function of Document Frequency (TF-FDF) to address this issue. Instead of using monotonically decreasing function such as Inverse Document Frequency, FDF presents a convex function that dynamically adjusts weights according to the significance of the words in a document set. This function can be manually tuned based on the distribution of the most meaningful words which semantically represent the document set. Our experiments show that the TF-FDF can achieve higher value of Normalised Discounted Cumulative Gain in IR than that of TF-IDF and its variants, and improving the accuracy of relevance ranking of the IR results.
引用
收藏
页码:433 / 444
页数:12
相关论文
共 50 条
  • [31] Concept-based term weighting for web information retrieval
    Zakos, J
    Verma, B
    ICCIMA 2005: SIXTH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND MULTIMEDIA APPLICATIONS, PROCEEDINGS, 2005, : 173 - 178
  • [32] Linear Time Series Models for Term Weighting in Information Retrieval
    Efron, Miles
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2010, 61 (07): : 1299 - 1312
  • [33] CONCEPT-BASED TERM WEIGHTING FOR WEB INFORMATION RETRIEVAL
    Zakos, John
    Verma, Brijesh
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2006, 6 (02) : 193 - 207
  • [34] Rank by Readability: Document Weighting for Information Retrieval
    Newbold, Neil
    McLaughlin, Harry
    Gillam, Lee
    ADVANCES IN MULTIDISCIPLINARY RETRIEVAL, 2010, 6107 : 20 - 30
  • [35] Exploiting category information and document information to improve term weighting for text categorization
    Li, Jingyang
    Sun, Maosong
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2007, 4394 : 587 - +
  • [36] Modified frequency-based term weighting scheme for accurate dark web content classification
    Sabbah, Thabit
    Selamat, Ali
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8870 : 184 - 196
  • [37] Modified Frequency-Based Term Weighting Scheme for Accurate Dark Web Content Classification
    Sabbah, Thabit
    Selamat, Ali
    INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2014, 2014, 8870 : 184 - 196
  • [38] A New Digital Signal Processing Based Model With Multi-Aspect Term Frequency for Information Retrieval
    Ying, Zhiwei
    Huang, Jimmy Xiangji
    Zhou, Jie
    IEEE ACCESS, 2019, 7 : 160738 - 160754
  • [39] Semantically Enhanced Term Frequency based on Word Embeddings for Arabic Information Retrieval
    El Mahdaouy, Abdelkader
    El Alaoui, Said Ouatik
    Gaussier, Eric
    2016 4TH IEEE INTERNATIONAL COLLOQUIUM ON INFORMATION SCIENCE AND TECHNOLOGY (CIST), 2016, : 385 - 389
  • [40] A New Improved Term Weighting Scheme for Text Categorization
    Nguyen Pham Xuan
    Hieu Le Quang
    KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2013), VOL 1, 2014, 244 : 261 - 270