H-ERNIE: A Multi-Granularity Pre-Trained Language Model for Web Search

被引:7
|
作者
Chu, Xiaokai [1 ,3 ]
Zhao, Jiashu [2 ]
Zou, Lixin [3 ]
Yin, Dawei [3 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Wilfrid Laurier Univ, Waterloo, ON, Canada
[3] Baidu Inc, Beijing, Peoples R China
关键词
Information Retrieval; Web Search; Pre-trained Language Models;
D O I
10.1145/3477495.3531986
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The pre-trained language models (PLMs), such as BERT and ERNIE, have achieved outstanding performance in many natural language understanding tasks. Recently, PLMs-based Information Retrieval models have also been investigated and showed substantially state-of-the-art effectiveness, e.g., MORES, PROP and Co1BERT. Moreover, most of the PLMs-based rankers only focus on a single level relevance matching (e.g., character-level), while ignore the other granularity information (e.g., words and phrases), which easily lead to the ambiguity of query understanding and inaccurate matching issues in web search. In this paper, we aim to improve the state-of-the-art PLMs ERNIE for web search, by modeling multi-granularity context information with the awareness of word importance in queries and documents. In particular, we propose a novel H-ERNIE framework, which includes a query-document analysis component and a hierarchical ranking component. The query-document analysis component has several individual modules which generate the necessary variables, such as word segmentation, word importance analysis, and word tightness analysis. Based on these variables, the importance-aware multiple-level correspondences are sent to the ranking model. The hierarchical ranking model includes a multi-layer transformer module to learn the character-level representations, a word-level matching module, and a phrase-level matching module with word importance. Each of these modules models the query and the document matching from a different perspective. Also, these levels are inherently communicated to achieve the overall accurate matching. We discuss the time complexity of the proposed framework, and show that it can be efficiently implemented in real applications. The offline and online experiments on both public datasets and a commercial search engine illustrate the effectiveness of the proposed H-ERNIE framework.
引用
收藏
页码:1478 / 1489
页数:12
相关论文
共 50 条
  • [41] SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL
    Shih, Yi-Jen
    Wang, Hsuan-Fu
    Chang, Heng-Jui
    Berry, Layne
    Lee, Hung-yi
    Harwath, David
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 715 - 722
  • [42] AraXLNet: pre-trained language model for sentiment analysis of Arabic
    Alhanouf Alduailej
    Abdulrahman Alothaim
    [J]. Journal of Big Data, 9
  • [43] ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding
    Wang, Chengyu
    Dai, Suyang
    Wang, Yipeng
    Yang, Fei
    Qiu, Minghui
    Chen, Kehan
    Zhou, Wei
    Huang, Jun
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1207 - 1218
  • [44] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
    Santini, Cristian
    Posthumus, Etienne
    Tietz, Tabea
    Tan, Mary Ann
    Bruns, Oleksandra
    Sack, Harald
    [J]. 2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
  • [45] MPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation Understanding
    Gu, Jia-Chen
    Tao, Chongyang
    Ling, Zhen-Hua
    Xu, Can
    Geng, Xiubo
    Jianel, Daxin
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3682 - 3692
  • [46] WebKE: Knowledge Extraction from Semi-structured Web with Pre-trained Markup Language Model
    Xie, Chenhao
    Huang, Wenhao
    Liang, Jiaqing
    Huang, Chengsong
    Xiao, Yanghua
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 2211 - 2220
  • [47] Electric Power Audit Text Classification With Multi-Grained Pre-Trained Language Model
    Meng, Qinglin
    Song, Yan
    Mu, Jian
    Lv, Yuanxu
    Yang, Jiachen
    Xu, Liang
    Zhao, Jin
    Ma, Junwei
    Yao, Wei
    Wang, Rui
    Xiao, Maoxiang
    Meng, Qingyu
    [J]. IEEE ACCESS, 2023, 11 : 13510 - 13518
  • [48] Multi-granularity spatial-temporal access control model for web GIS
    Zhang, Ai-juan
    Gao, Jing-xiang
    Ji, Cheng
    Sun, Jiu-yun
    Bao, Yu
    [J]. TRANSACTIONS OF NONFERROUS METALS SOCIETY OF CHINA, 2014, 24 (09) : 2946 - 2953
  • [49] A teacher action recognition model based on pre-trained language and video model
    Luo, Sen
    Zhou, Juxiang
    Wen, Xiaoyu
    Li, Hao
    [J]. PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON EDUCATION TECHNOLOGY AND COMPUTERS, ICETC 2023, 2023, : 335 - 340
  • [50] RoBERTuito: a pre-trained language model for social media text in Spanish
    Manuel Perez, Juan
    Furman, Damian A.
    Alonso Alemany, Laura
    Luque, Franco
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7235 - 7243