Pyserini: A Python']Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations

被引:137
|
作者
Lin, Jimmy [1 ]
Ma, Xueguang [1 ]
Lin, Sheng-Chieh [1 ]
Yang, Jheng-Hong [1 ]
Pradeep, Ronak [1 ]
Nogueira, Rodrigo [1 ]
机构
[1] Univ Waterloo, David R Cheriton Sch Comp Sci, Waterloo, ON, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Open-Source Search Engine; First-Stage Retrieval;
D O I
10.1145/3404835.3463238
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. Around this toolkit, our group has built a culture of reproducibility through shared norms and tools that enable rigorous automated testing.
引用
收藏
页码:2356 / 2362
页数:7
相关论文
共 12 条
  • [1] Python']Python Tools for Reproducible Research on Hyperbolic Problems
    LeVeque, Randall J.
    [J]. COMPUTING IN SCIENCE & ENGINEERING, 2009, 11 (01) : 19 - 27
  • [2] Parrot: A Python']Python-based Interactive Platform for Information Retrieval Research
    Tu, Xinhui
    Huang, Jimmy
    Luo, Jing
    Zhu, Runjie
    He, Tingting
    [J]. PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 1289 - 1292
  • [3] A Python']Python Instrument Control and Data Acquisition Suite for Reproducible Research
    Koerner, Lucas J.
    Caswell, Thomas A.
    Allan, Daniel B.
    Campbell, Stuart I.
    [J]. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2020, 69 (04) : 1698 - 1707
  • [4] WIKIR: A Python']Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset
    Frej, Jibril
    Schwab, Didier
    Chevallet, Jean-Pierre
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1926 - 1933
  • [5] Multimedia Information Retrieval in Big Data using OpenCV Python']Python
    Goularte, Rudinei
    Trojahn, Tiago H.
    Kishi, Rodrigo M.
    [J]. WEBMEDIA 2019: PROCEEDINGS OF THE 25TH BRAZILLIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB, 2019, : 25 - 27
  • [6] PyTerrier: Declarative Experimentation in Python']Python from BM25 to Dense Retrieval
    Macdonald, Craig
    Tonellotto, Nicola
    MacAveney, Sean
    Ounis, Iadh
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4526 - 4533
  • [7] Patapasco: A Python']Python Framework for Cross-Language Information Retrieval Experiments
    Costello, Cash
    Yang, Eugene
    Lawrie, Dawn
    Mayfield, James
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PT II, 2022, 13186 : 276 - 280
  • [8] A Framework of Petroleum Information Retrieval System Based On Web Scraping With Python']Python
    Ren, Yili
    Ren, Yiting
    [J]. 2018 15TH INTERNATIONAL CONFERENCE ON SERVICE SYSTEMS AND SERVICE MANAGEMENT (ICSSSM), 2018,
  • [9] Data management routines for reproducible research using the G-Node Python']Python Client library
    Sobolev, Andrey
    Stoewer, Adrian
    Pereira, Michael
    Kellner, Christian J.
    Garbers, Christian
    Rautenberg, Philipp L.
    Wachtler, Thomas
    [J]. FRONTIERS IN NEUROINFORMATICS, 2014, 8
  • [10] Sparse, Dense, and Attentional Representations for Text Retrieval
    Luan, Yi
    Eisenstein, Jacob
    Toutanova, Kristina
    Collins, Michael
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 (329-345) : 329 - 345