WIKIR: A Python']Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

被引:0
|
作者
Frej, Jibril [1 ]
Schwab, Didier [1 ]
Chevallet, Jean-Pierre [1 ]
机构
[1] Univ Grenoble Alpes, CNRS, Grenoble INP, LIG, F-38000 Grenoble, France
关键词
Information Retrieval; Open Source; Dataset; Deep Learning;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR78k and wikIRS78k: two large-scale publicly available datasets that both contain 78,628 queries and 3,060,191 (query, relevant documents) pairs.
引用
收藏
页码:1926 / 1933
页数:8
相关论文
共 21 条
  • [1] COLOSSUS: A Python']Python Toolkit for Cosmology, Large-scale Structure, and Dark Matter Halos
    Diemer, Benedikt
    ASTROPHYSICAL JOURNAL SUPPLEMENT SERIES, 2018, 239 (02):
  • [2] Nengo: a Python']Python tool for building large-scale functional brain models
    Bekolay, Trevor
    Bergstra, James
    Hunsberger, Eric
    DeWolf, Travis
    Stewart, Terrence C.
    Rasmussen, Daniel
    Choo, Xuan
    Voelker, Aaron Russell
    Eliasmith, Chris
    FRONTIERS IN NEUROINFORMATICS, 2014, 7
  • [3] Genome-wide primer scan (GPS): a python']python package for a flexible, reliable and large-scale primer design toolkit
    He, Wencong
    Zhuo, Yan
    Wang, Chen
    Huang, Yemei
    Zang, Xuelei
    Yang, Chen
    Deng, Hengyu
    Zhou, Yangyu
    Liu, Jing
    Zhang, Ping
    Xue, Xinying
    Zhang, Liye
    FRONTIERS OF COMPUTER SCIENCE, 2025, 19 (02)
  • [4] iCaChip: A Python']Python-based Toolbox for Multimodal Analysis of Large-scale Neural Recordings
    Hu, Xin
    Amin, Hayder
    Kluetsch, Diana
    JOURNAL OF COMPUTATIONAL NEUROSCIENCE, 2024, 52 : S70 - S71
  • [5] iCaChip: A Python']Python-based Toolbox for Multimodal Analysis of Large-scale Neural Recordings
    Hu, Xin
    Amin, Hayder
    Kluetsch, Diana
    JOURNAL OF COMPUTATIONAL NEUROSCIENCE, 2024, 52 : S70 - S71
  • [6] Research and deployment of a Python']Python-based software framework for large-scale physical experiment control
    Xia, Shouteng
    Zhang, Yinhong
    Qian, Sen
    Song, Ruiqiang
    Yang, Jie
    JOURNAL OF INSTRUMENTATION, 2023, 18 (10)
  • [7] Large-scale experiment in STEM education for high school students using artificial intelligence kit based on computer vision and Python']Python
    Lohakan, Meechai
    Seetao, Choochat
    HELIYON, 2024, 10 (10)
  • [8] A Python']Python-based MPI framework for exploring an adaptive fuzzy-agent approach to simulating large-scale non-cooperative games
    Millman, Eamon
    Budakoglu, Caner
    Neville, Stephen W.
    2007 CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1-3, 2007, : 1384 - 1387
  • [9] Attention-Based Graph Summarization for Large-Scale Information Retrieval
    Shabani, Nasrin
    Beheshti, Amin
    Jolfaei, Alireza
    Wu, Jia
    Haghighi, Venus
    Najafabadi, Maryam Khanian
    Foo, Jin
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2024, 70 (03) : 6224 - 6235
  • [10] A large-scale dataset of patient summaries for retrieval-based clinical decision support systems
    Zhao, Zhengyun
    Jin, Qiao
    Chen, Fangyuan
    Peng, Tuorui
    Yu, Sheng
    SCIENTIFIC DATA, 2023, 10 (01)