The Power of Selecting Key Blocks with Local Pre-ranking for Long Document Information Retrieval

被引:9
|
作者
Li, Minghan [1 ]
Popa, Diana Nicoleta [2 ]
Chagnon, Johan [3 ]
Cinar, Yagmur Gizem [4 ]
Gaussier, Eric [1 ]
机构
[1] Univ Grenoble Alpes, Grenoble, France
[2] Telepathy Labs, Zurich, Switzerland
[3] Univ Wollongong, Wollongong, NSW, Australia
[4] Amazon, London, England
关键词
BERT-based language models; long-document neural information retrieval;
D O I
10.1145/3568394
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
On a wide range of natural language processing and information retrieval tasks, transformer-based models, particularly pre-trained language models like BERT, have demonstrated tremendous effectiveness. Due to the quadratic complexity of the self-attention mechanism, however, such models have difficulties processing long documents. Recent works dealing with this issue include truncating long documents, in which case one loses potential relevant information, segmenting them into several passages, which may lead to miss some information and high computational complexity when the number of passages is large, or modifying the self-attention mechanism to make it sparser as in sparse-attention models, at the risk again of missing some information. We follow here a slightly different approach in which one first selects key blocks of a long document by local query-block pre-ranking, and then few blocks are aggregated to form a short document that can be processed by a model such as BERT. Experiments conducted on standard Information Retrieval datasets demonstrate the effectiveness of the proposed approach.
引用
收藏
页数:35
相关论文
共 19 条
  • [1] KeyBLD: Selecting Key Blocks with Local Pre-ranking for Long Document Information Retrieval
    Li, Minghan
    Gaussier, Eric
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 2207 - 2211
  • [2] High Dimensional Feature Matching and Retrieval Based on Pre-ranking Algorithm
    Zhu, Shuai
    Zhao, Hong-wei
    Liu, Ping-ping
    [J]. COMPUTER SCIENCE AND TECHNOLOGY (CST2016), 2017, : 453 - 460
  • [3] Generalized Ensemble Model for Document Ranking in Information Retrieval
    Wang, Yanshan
    Choi, In-Chan
    Liu, Hongfang
    [J]. COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2017, 14 (01) : 123 - 151
  • [4] Document retrieval based on key information of sentence
    Gautam, Dipesh
    Cho, Miyoung
    Kim, Pankoo
    [J]. 10TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY, VOLS I-III: INNOVATIONS TOWARD FUTURE NETWORKS AND SERVICES, 2008, : 2040 - 2042
  • [5] Refining aggregation functions for improving document ranking in information retrieval
    Boughanem, Mohand
    Loiseau, Yannick
    Prade, Henri
    [J]. SCALABLE UNCERTAINTY MANAGEMENT, PROCEEDINGS, 2007, 4772 : 255 - +
  • [6] A probabilistic information retrieval model by document ranking using term dependencies
    You, Hyun-Jo
    Lee, Jung-Jin
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2019, 32 (05) : 763 - 782
  • [7] LoGE: an unsupervised local-global document extension generation in information retrieval for long documents
    Ayoub, Oussama
    Rodrigues, Christophe
    Travers, Nicolas
    [J]. INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2023, 19 (5-6) : 244 - 262
  • [8] Document re-ranking by generality in bio-medical information retrieval
    Yan, X
    Li, X
    Song, DW
    [J]. WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 376 - 389
  • [9] Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval
    Wang, Jun
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5478 : 4 - 16
  • [10] A Neural Autoencoder Approach for Document Ranking and Query Refinement in Pharmacogenomic Information Retrieval
    Pfeiffer, Jonas
    Broscheit, Samuel
    Gemulla, Rainer
    Goeschl, Mathias
    [J]. SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2018), 2018, : 87 - 97