MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization

被引:0
|
作者
Xu, Canwen [1 ]
Pei, Jiaxin [2 ]
Wu, Hongtao [3 ]
Liu, Yiyu [3 ]
Li, Chenliang [3 ]
机构
[1] Wuhan Univ, Sch Comp Sci, Wuhan, Hubei, Peoples R China
[2] Univ Michigan, Sch Informat, Ann Arbor, MI 48109 USA
[3] Wuhan Univ, Sch Cyber Sci & Engn, Wuhan, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, large-scale datasets have vastly facilitated the development in nearly all domains of Natural Language Processing. However, there is currently no cross-task dataset in NLP, which hinders the development of multi-task learning. We propose MATINF, the first jointly labeled large-scale dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and user-generated question descriptions. Based on such rich information, MATINF is applicable for three major NLP tasks, including classification, question answering, and summarization. We benchmark existing methods and a novel multi-task baseline over MATINF to inspire further research. Our comprehensive comparison and experiments over MATINF and other datasets demonstrate the merits held by MATINF.
引用
收藏
页码:3586 / 3596
页数:11
相关论文
共 50 条
  • [21] BioASQ: A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
    Balikas, Georgios
    Krithara, Anastasia
    Partalas, Ioannis
    Paliouras, George
    [J]. MULTIMODAL RETRIEVAL IN THE MEDICAL DOMAIN, MRMD 2015, 2015, 9059 : 26 - 39
  • [22] Mr. HiSum: A Large-scale Dataset for Video Highlight Detection and Summarization
    Sul, Jinhwan
    Han, Jihoon
    Lee, Joonseok
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [23] A Large Visual Question Answering Dataset for Cultural Heritage
    Asprino, Luigi
    Bulla, Luana
    Marinucci, Ludovica
    Mongiovi, Misael
    Presutti, Valentina
    [J]. MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE (LOD 2021), PT II, 2022, 13164 : 193 - 197
  • [24] DACSA: A large-scale Dataset for Automatic summarization of Catalan and Spanish newspaper Articles
    Segarra, Encarna
    Ahuir, Vicent
    Hurtado, Lluis-F
    Angel Gonzalez, Jose
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5931 - 5943
  • [25] An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition
    George Tsatsaronis
    Georgios Balikas
    Prodromos Malakasiotis
    Ioannis Partalas
    Matthias Zschunke
    Michael R Alvers
    Dirk Weissenborn
    Anastasia Krithara
    Sergios Petridis
    Dimitris Polychronopoulos
    Yannis Almirantis
    John Pavlopoulos
    Nicolas Baskiotis
    Patrick Gallinari
    Thierry Artiéres
    Axel-Cyrille Ngonga Ngomo
    Norman Heino
    Eric Gaussier
    Liliana Barrio-Alvers
    Michael Schroeder
    Ion Androutsopoulos
    Georgios Paliouras
    [J]. BMC Bioinformatics, 16
  • [26] CFO: Conditional Focused Neural Question Answering with Large-scale Knowledge Bases
    Dai, Zihang
    Li, Lei
    Xu, Wei
    [J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 800 - 810
  • [27] An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition
    Tsatsaronis, George
    Balikas, Georgios
    Malakasiotis, Prodromos
    Partalas, Ioannis
    Zschunke, Matthias
    Alvers, Michael R.
    Weissenborn, Dirk
    Krithara, Anastasia
    Petridis, Sergios
    Polychronopoulos, Dimitris
    Almirantis, Yannis
    Pavlopoulos, John
    Baskiotis, Nicolas
    Gallinari, Patrick
    Artieres, Thierry
    Ngomo, Axel-Cyrille Ngonga
    Heino, Norman
    Gaussier, Eric
    Barrio-Alvers, Liliana
    Schroeder, Michael
    Androutsopoulos, Ion
    Paliouras, Georgios
    [J]. BMC BIOINFORMATICS, 2015, 16
  • [28] Arabic Question Answering System for Information Retrieval on Large-scale Image Objects
    Al-Zubi, Sawsan
    Awaysheh, Feras M.
    Al-Shboul, Bashar Awad
    [J]. 2021 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT DATA SCIENCE TECHNOLOGIES AND APPLICATIONS (IDSTA), 2021, : 162 - 170
  • [29] Extractive Text Summarization on Large-scale Dataset Using K-Means Clustering
    Ti-Hon Nguyen
    Thanh-Nghi Do
    [J]. ADVANCES AND TRENDS IN ARTIFICIAL INTELLIGENCE: THEORY AND PRACTICES IN ARTIFICIAL INTELLIGENCE, 2022, 13343 : 737 - 746
  • [30] An astronomical question answering dataset for evaluating large language models
    Jie Li
    Fuyong Zhao
    Panfeng Chen
    Jiafu Xie
    Xiangrui Zhang
    Hui Li
    Mei Chen
    Yanhao Wang
    Ming Zhu
    [J]. Scientific Data, 12 (1)