A set of novel HTML']HTML document quality features for Web information retrieval: Including applications to learning to rank for information retrieval

被引:0
|
作者
Aydin, Ahmet [1 ]
Arslan, Ahmet [1 ]
Dincer, Bekir Taner [2 ]
机构
[1] Eskisehir Tech Univ, Dept Comp Engn, TR-26555 Eskisehir, Turkiye
[2] Mugla Sıtkı Kocman Univ, Dept Comp Engn, TR-48000 Mugla, Turkiye
关键词
Information retrieval; Web search; Learning to rank; Machine learning; Search engines;
D O I
10.1016/j.eswa.2024.123177
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The past work on Information Retrieval (IR) targeting web document collections shows that incorporating a measure that measures the quality of web documents, or rather the document prior (e.g., PageRank), into an IR system improves the retrieval effectiveness. In this study, we introduce new document priors and empirically investigate their effect by employing them as features in a learning to rank (LTR) deployment. The experiments are performed on the two standard Web IR test collections: the ClueWeb09 and the ClueWeb12 datasets, which include 500 and 733 million web documents, respectively, and the associated TREC & NTCIR query sets with a total number of 1,204 queries. A strong baseline is formed by using standard features introduced in the previous works, with respect to which the effect of newly introduced features in this paper is empirically compared. We test our features by LambdaMART, which is state-of-the-art LTR technique. The results reveal that the features introduced in this work led improvement in retrieval performance on the test collections in use. The introduced features are classified into 5 groups with respect to functional properties and each group is also analyzed in detail.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Enhanced information retrieval by using HTML']HTML tags
    Werner, L
    Böttcher, S
    Beckmann, R
    [J]. DMIN '05: PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON DATA MINING, 2005, : 24 - 29
  • [2] Detecting similar HTML']HTML documents using a fuzzy set information retrieval approach
    Yerra, R
    Ng, YK
    [J]. 2005 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, VOLS 1 AND 2, 2005, : 693 - 699
  • [3] A fuzzy representation of HTML']HTML documents for information retrieval systems
    Molinari, A
    Pasi, G
    [J]. FUZZ-IEEE '96 - PROCEEDINGS OF THE FIFTH IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1-3, 1996, : 107 - 112
  • [4] Genetic mining of HTML']HTML structures for effective Web-document retrieval
    Kim, S
    Zhang, BT
    [J]. APPLIED INTELLIGENCE, 2003, 18 (03) : 243 - 256
  • [5] Genetic Mining of HTML Structures for Effective Web-Document Retrieval
    Sun Kim
    Byoung-Tak Zhang
    [J]. Applied Intelligence, 2003, 18 : 243 - 256
  • [6] Learning to rank for Information Retrieval
    Liu, Tie-Yan
    [J]. Foundations and Trends in Information Retrieval, 2009, 3 (03): : 225 - 231
  • [7] Learning to Rank for Information Retrieval
    Liu, Tie-Yan
    [J]. SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 904 - 904
  • [8] Rank by Readability: Document Weighting for Information Retrieval
    Newbold, Neil
    McLaughlin, Harry
    Gillam, Lee
    [J]. ADVANCES IN MULTIDISCIPLINARY RETRIEVAL, 2010, 6107 : 20 - 30
  • [9] Evolutionary learning of Web-document structure for information retrieval
    Kim, S
    Zhang, BT
    [J]. PROCEEDINGS OF THE 2001 CONGRESS ON EVOLUTIONARY COMPUTATION, VOLS 1 AND 2, 2001, : 1253 - 1260
  • [10] Parallel Learning to Rank for Information Retrieval
    Wang, Shuaiqiang
    Gao, Byron J.
    Wang, Ke
    Lauw, Hady W.
    [J]. PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 1083 - 1084