Pira: A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean

被引:6
|
作者
Paschoal, Andre F. A. [1 ]
Pirozelli, Paulo [2 ]
Freire, Valdinei [1 ]
Delgado, Karina V. [1 ]
Peres, Sarajane M. [1 ]
Jose, Marcos M. [3 ]
Nakasato, Flavio [3 ]
Oliveira, Andre S. [3 ]
Brandao, Anarosa A. F. [3 ]
Costa, Anna H. R. [3 ]
Cozman, Fabio G. [3 ]
机构
[1] Univ Sao Paulo, Escola Artes Ciencias & Humanidades, Sao Paulo, Brazil
[2] Univ Sao Paulo, Inst Estudos Avancados, Sao Paulo, Brazil
[3] Univ Sao Paulo, Escola Politecn, Sao Paulo, Brazil
基金
巴西圣保罗研究基金会;
关键词
Question-answering dataset; Bilingual dataset; Portuguese-English dataset; Ocean dataset;
D O I
10.1145/3459637.3482012
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Current research in natural language processing is highly dependent on carefully produced corpora. Most existing resources focus on English; some resources focus on languages such as Chinese and French; few resources deal with more than one language. This paper presents the Pira dataset, a large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English. Pira is, to the best of our knowledge, the first QA dataset with supporting texts in Portuguese, and, perhaps more importantly, the first bilingual QA dataset that includes this language. The Pira dataset consists of 2261 properly curated question/answer (QA) sets in both languages. The QA sets were manually created based on two corpora: abstracts related to the Brazilian coast and excerpts of United Nation reports about the ocean. The QA sets were validated in a peer-review process with the dataset contributors. We discuss some of the advantages as well as limitations of Pira, as this new resource can support a set of tasks in NLP such as question-answering, information retrieval, and machine translation.
引用
收藏
页码:4544 / 4553
页数:10
相关论文
共 31 条
  • [1] MemoriQA: A Question-Answering Lifelog Dataset
    Tran, Quang-Linh
    Nguyen, Binh
    Jones, Gareth J. F.
    Gurrin, Cathal
    [J]. PROCEEDINGS OF THE FIRST ACM WORKSHOP ON AI-POWERED QUESTION ANSWERING SYSTEMS FOR MULTIMEDIA, AIQAM 2024, 2024, : 7 - 12
  • [2] RAPPORT - A Portuguese Question-Answering System
    Rodrigues, Ricardo
    Gomes, Paulo
    [J]. PROGRESS IN ARTIFICIAL INTELLIGENCE-BK, 2015, 9273 : 771 - 782
  • [3] One Stop Shop for Question-Answering Dataset Selection
    Chuy, Chang Nian
    Ding, Chen
    Hu, Qinmin Vivian
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3115 - 3119
  • [4] AUTOMATIC QUESTION-ANSWERING OF ENGLISH-LIKE QUESTIONS ABOUT SIMPLE DIAGRAMS
    KOCHEN, M
    [J]. JOURNAL OF THE ACM, 1969, 16 (01) : 26 - &
  • [5] A Portuguese Dataset for Evaluation of Semantic Question Answering
    de Araujo, Denis Andrei
    Rigo, Sandro Jose
    Quaresma, Paulo
    Muniz, Joao Henrique
    [J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2020, 2020, 12037 : 217 - 227
  • [6] Project PIAF: Building a Native French Question-Answering Dataset
    Keraron, Rachel
    Lancrenon, Guillaume
    Bras, Mathilde
    Allary, Frederic
    Moyse, Gilles
    Scialom, Thomas
    Soriano-Morales, Edmundo-Pavel
    Staiano, Jacopo
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5481 - 5490
  • [7] Orthographic context sensitivity in vowel decoding by Portuguese monolingual and Portuguese-English bilingual children
    Vale, Ana Paula
    [J]. JOURNAL OF RESEARCH IN READING, 2011, 34 (01) : 43 - 58
  • [8] BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles
    Zhang, Yunxiang
    Wan, Xiaojun
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 11748 - 11756
  • [9] A study about the future evaluation of Question-Answering systems
    Rodrigo, Alvaro
    Penas, Anselmo
    [J]. KNOWLEDGE-BASED SYSTEMS, 2017, 137 : 83 - 93
  • [10] XMQAs: Constructing Complex-Modified Question-Answering Dataset for Robust Question Understanding
    Chen, Yuyan
    Xiao, Yanghua
    Li, Zhixu
    Liu, Bang
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (03) : 1371 - 1384