Webformer: Pre-training with Web Pages for Information Retrieval

被引:15
|
作者
Guo, Yu [1 ,4 ]
Ma, Zhengyi [1 ]
Mao, Jiaxin [1 ]
Qian, Hongjin [1 ]
Zhang, Xinyu [2 ]
Jiang, Hao [2 ]
Cao, Zhao [2 ]
Dou, Zhicheng [1 ,3 ]
机构
[1] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China
[2] Huawei, Distributed & Parallel Software Lab, Beijing, Peoples R China
[3] Beijing Key Lab Big Data Management & Anal Method, Beijing, Peoples R China
[4] Huawei, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Ad-hoc Retrieval; Pre-training; Web Page; DOM Tree; !text type='HTML']HTML[!/text;
D O I
10.1145/3477495.3532086
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Pre-trained language models (PLMs) have achieved great success in the area of Information Retrieval. Studies show that applying these models to ad-hoc document ranking can achieve better retrieval effectiveness. However, on the Web, most information is organized in the form of HTML web pages. In addition to the pure text content, the structure of the content organized by HTML tags is also an important part of the information delivered on a web page. Currently, such structured information is totally ignored by pre-trained models which are trained solely based on text content. In this paper, we propose to leverage large-scale web pages and their DOM (Document Object Model) tree structures to pre-train models for information retrieval. We argue that using the hierarchical structure contained in web pages, we can get richer contextual information for training better language models. To exploit this kind of information, we devise four pre-training objectives based on the structure of web pages, then pre-train a Transformer model towards these tasks jointly with traditional masked language model objective. Experimental results on two authoritative ad-hoc retrieval datasets prove that our model can significantly improve ranking performance compared to existing pre-trained models.
引用
收藏
页码:1502 / 1512
页数:11
相关论文
共 50 条
  • [1] Pre-training Methods in Information Retrieval
    Fan, Yixing
    Xie, Xiaohui
    Cai, Yinqiong
    Chen, Jia
    Ma, Xinyu
    Li, Xiangsheng
    Zhang, Ruqing
    Guo, Jiafeng
    [J]. FOUNDATIONS AND TRENDS IN INFORMATION RETRIEVAL, 2022, 16 (03): : 178 - 317
  • [2] Pre-training Assessment Through the Web
    Kenneth Wong
    Reggie Kwan
    Jimmy SF Chan
    [J]. 厦门大学学报(自然科学版), 2002, (S1) : 297 - 297
  • [3] Condenser: a Pre-training Architecture for Dense Retrieval
    Gao, Luyu
    Callan, Jamie
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 981 - 993
  • [4] Pre-Training for Mathematics-Aware Retrieval
    Reusch, Anja
    [J]. PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 3496 - 3496
  • [5] Fast Information Retrieval from Web Pages
    El-Bakry, Hazem M.
    Mastorakis, Nikos
    [J]. PROCEEDINGS OF THE 7TH WSEAS INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS (CIMMACS '08): RECENT ADVANCES IN COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS, 2008, : 229 - +
  • [6] Mining unstructured web pages to enhance web information retrieval
    Yang, Hsin-Chang
    Lee, Chung-Hong
    [J]. ICICIC 2006: FIRST INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING, INFORMATION AND CONTROL, VOL 2, PROCEEDINGS, 2006, : 429 - +
  • [7] Information Retrieval Based on Image Detection on Web Pages
    El-Bakry, Hazem M.
    Mastorakis, Nikos
    [J]. PROCEEDINGS OF THE 7TH WSEAS INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS (CIMMACS '08): RECENT ADVANCES IN COMPUTATIONAL INTELLIGENCE, MAN-MACHINE SYSTEMS AND CYBERNETICS, 2008, : 221 - +
  • [8] REALM: Retrieval-Augmented Language Model Pre-Training
    Guu, Kelvin
    Lee, Kenton
    Tung, Zora
    Pasupat, Panupong
    Chang, Ming-Wei
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 119, 2020, 119
  • [9] SIMLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval
    Wang, Liang
    Yang, Nan
    Huang, Xiaolong
    Jiao, Binxing
    Yang, Linjun
    Jiang, Daxin
    Majumder, Rangan
    Wei, Furu
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 2244 - 2258
  • [10] Term frequency occurrences on web pages for textual information retrieval
    Sivapathasundaram, Karthika
    Cheng, Xiaochun
    Petridis, Miltos
    [J]. DATA SCIENCE AND KNOWLEDGE ENGINEERING FOR SENSING DECISION SUPPORT, 2018, 11 : 585 - 590