Data cleansing for web information retrieval using query independent features

被引:4
|
作者
Liu, Yiqun [1 ]
Zhang, Min
Cen, Rongwei
Ru, Liyun
Ma, Shaoping
机构
[1] Tsinghua Univ, State Key Lab Intelligent Technol & Syst, Beijing, Peoples R China
[2] Sohu Corp, R&D Ctr, Beijing, Peoples R China
关键词
D O I
10.1002/asi.20633
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding what kinds of Web pages are the most useful for Web search engine users is a critical task in Web information retrieval (IR). Most previous works used hyperlink analysis algorithms to solve this problem. However, little research has been focused on query-independent Web data cleansing for Web IR. In this paper, we first provide analysis of the differences between retrieval target pages and ordinary ones based on more than 30 million Web pages obtained from both the Text Retrieval Conference (TREC) and a widely used Chinese search engine, SOGOU (www.sogou.com). We further propose a learning-based data cleansing algorithm for reducing Web pages that are unlikely to be useful for user requests. We found that there exists a large proportion of low-quality Web pages in both the English and the Chinese Web page corpus, and retrieval target pages can be identified using query-independent features and cleansing algorithms. The experimental results showed that our algorithm is effective in reducing a large portion of Web pages with a small loss in retrieval target pages. It makes it possible for Web IR tools to meet a large fraction of users' needs with only a small part of pages on the Web. These results may help Web search engines make better use of their limited storage and computation resources to improve search performance.
引用
收藏
页码:1884 / 1898
页数:15
相关论文
共 50 条
  • [1] Selective Application of Query-Independent Features in Web Information Retrieval
    Peng, Jie
    Ounis, Iadh
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5478 : 375 - +
  • [2] Query disambiguation for Cross-Language Information Retrieval using Web directories
    Kimura, F
    Maeda, A
    Miyazaki, J
    Uemura, S
    [J]. INTERNATIONAL WORKSHOP ON CHALLENGES IN WEB INFORMATION RETRIEVAL AND INTEGRATION, PROCEEDINGS, 2005, : 151 - 156
  • [3] A smart web query method for semantic retrieval of web data
    Chiang, RHL
    Chua, CEH
    Storey, VC
    [J]. DATA & KNOWLEDGE ENGINEERING, 2001, 38 (01) : 63 - 84
  • [4] Using query contexts in information retrieval
    Bai, Jing
    Nie, Jian-Yun
    Cao, Guihong
    Bouchard, Hugues
    [J]. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07, 2007, : 15 - 22
  • [5] Towards distributed information retrieval in the Semantic Web: Query reformulation using the oMAP framework
    Straccia, Umberto
    Troncy, Raphael
    [J]. SEMANTIC WEB: RESEARCH AND APPLICATIONS, PROCEEDINGS, 2006, 4011 : 378 - 392
  • [6] Query reformulation for information retrieval on the Web using the point of view methodology:: Preliminary results
    Naït-Baha, L
    Jackiewicz, A
    Djioua, B
    Laublet, P
    [J]. KNOWLEDGE ORGANIZATION, 2001, 28 (03): : 129 - 136
  • [7] Improving query expansion using pseudo-relevant web knowledge for information retrieval
    Azad, Hiteshwar Kumar
    Deepak, Akshay
    Chakraborty, Chinmay
    Abhishek, Kumar
    [J]. PATTERN RECOGNITION LETTERS, 2022, 158 : 148 - 156
  • [8] A Hybrid Query Disambiguation Adaptive Approach for Web Information Retrieval
    Ibrahim, Roliana
    Kamal, Shahid
    Ghani, Imran
    Jeong, Seung Ryul
    [J]. KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2015, 9 (07): : 2468 - 2487
  • [9] Hybrid query processing for personalized information retrieval on the Semantic Web
    Yoo, Donghee
    [J]. KNOWLEDGE-BASED SYSTEMS, 2012, 27 : 211 - 218
  • [10] Using query probing to identify query language features on the Web
    Bergholz, A
    Chidlovskii, B
    [J]. DISTRIBUTED MULTIMEDIA INFORMATION RETRIEVAL, 2004, 2924 : 21 - 30