The Power of Selecting Key Blocks with Local Pre-ranking for Long Document Information Retrieval

被引：9

作者：

Li, Minghan ^{[1
]}

Popa, Diana Nicoleta ^{[2
]}

Chagnon, Johan ^{[3
]}

Cinar, Yagmur Gizem ^{[4
]}

Gaussier, Eric ^{[1
]}

机构：

[1] Univ Grenoble Alpes, Grenoble, France

[2] Telepathy Labs, Zurich, Switzerland

[3] Univ Wollongong, Wollongong, NSW, Australia

[4] Amazon, London, England

来源：

ACM TRANSACTIONS ON INFORMATION SYSTEMS | 2023年 / 41卷 / 03期

关键词：

BERT-based language models; long-document neural information retrieval;

D O I：

10.1145/3568394

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

On a wide range of natural language processing and information retrieval tasks, transformer-based models, particularly pre-trained language models like BERT, have demonstrated tremendous effectiveness. Due to the quadratic complexity of the self-attention mechanism, however, such models have difficulties processing long documents. Recent works dealing with this issue include truncating long documents, in which case one loses potential relevant information, segmenting them into several passages, which may lead to miss some information and high computational complexity when the number of passages is large, or modifying the self-attention mechanism to make it sparser as in sparse-attention models, at the risk again of missing some information. We follow here a slightly different approach in which one first selects key blocks of a long document by local query-block pre-ranking, and then few blocks are aggregated to form a short document that can be processed by a model such as BERT. Experiments conducted on standard Information Retrieval datasets demonstrate the effectiveness of the proposed approach.

引用

页数：35

共 19 条

[1] KeyBLD: Selecting Key Blocks with Local Pre-ranking for Long Document Information Retrieval
Li, Minghan
Gaussier, Eric
[J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 2207 - 2211
[2] High Dimensional Feature Matching and Retrieval Based on Pre-ranking Algorithm
Zhu, Shuai
Zhao, Hong-wei
Liu, Ping-ping
[J]. COMPUTER SCIENCE AND TECHNOLOGY (CST2016), 2017, : 453 - 460
[3] Generalized Ensemble Model for Document Ranking in Information Retrieval
Wang, Yanshan
Choi, In-Chan
Liu, Hongfang
[J]. COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2017, 14 (01) : 123 - 151
[4] Document retrieval based on key information of sentence
Gautam, Dipesh
Cho, Miyoung
Kim, Pankoo
[J]. 10TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY, VOLS I-III: INNOVATIONS TOWARD FUTURE NETWORKS AND SERVICES, 2008, : 2040 - 2042
[5] Refining aggregation functions for improving document ranking in information retrieval
Boughanem, Mohand
Loiseau, Yannick
Prade, Henri
[J]. SCALABLE UNCERTAINTY MANAGEMENT, PROCEEDINGS, 2007, 4772 : 255 - +
[6] A probabilistic information retrieval model by document ranking using term dependencies
You, Hyun-Jo
Lee, Jung-Jin
[J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2019, 32 (05) : 763 - 782
[7] LoGE: an unsupervised local-global document extension generation in information retrieval for long documents
Ayoub, Oussama
Rodrigues, Christophe
Travers, Nicolas
[J]. INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2023, 19 (5-6) : 244 - 262
[8] Document re-ranking by generality in bio-medical information retrieval
Yan, X
Li, X
Song, DW
[J]. WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 376 - 389
[9] Mean-Variance Analysis: A New Document Ranking Theory in Information Retrieval
Wang, Jun
[J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5478 : 4 - 16
[10] A Neural Autoencoder Approach for Document Ranking and Query Refinement in Pharmacogenomic Information Retrieval
Pfeiffer, Jonas
Broscheit, Samuel
Gemulla, Rainer
Goeschl, Mathias
[J]. SIGBIOMED WORKSHOP ON BIOMEDICAL NATURAL LANGUAGE PROCESSING (BIONLP 2018), 2018, : 87 - 97

← 1 2 →