MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization

被引：0

作者：

Xu, Canwen ^{[1
]}

Pei, Jiaxin ^{[2
]}

Wu, Hongtao ^{[3
]}

Liu, Yiyu ^{[3
]}

Li, Chenliang ^{[3
]}

机构：

[1] Wuhan Univ, Sch Comp Sci, Wuhan, Hubei, Peoples R China

[2] Univ Michigan, Sch Informat, Ann Arbor, MI 48109 USA

[3] Wuhan Univ, Sch Cyber Sci & Engn, Wuhan, Hubei, Peoples R China

来源：

58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020) | 2020年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, large-scale datasets have vastly facilitated the development in nearly all domains of Natural Language Processing. However, there is currently no cross-task dataset in NLP, which hinders the development of multi-task learning. We propose MATINF, the first jointly labeled large-scale dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and user-generated question descriptions. Based on such rich information, MATINF is applicable for three major NLP tasks, including classification, question answering, and summarization. We benchmark existing methods and a novel multi-task baseline over MATINF to inspire further research. Our comprehensive comparison and experiments over MATINF and other datasets demonstrate the merits held by MATINF.

引用

页码：3586 / 3596

页数：11

共 50 条

[21] BioASQ: A Challenge on Large-Scale Biomedical Semantic Indexing and Question Answering
Balikas, Georgios
Krithara, Anastasia
Partalas, Ioannis
Paliouras, George
[J]. MULTIMODAL RETRIEVAL IN THE MEDICAL DOMAIN, MRMD 2015, 2015, 9059 : 26 - 39
[22] Mr. HiSum: A Large-scale Dataset for Video Highlight Detection and Summarization
Sul, Jinhwan
Han, Jihoon
Lee, Joonseok
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[23] A Large Visual Question Answering Dataset for Cultural Heritage
Asprino, Luigi
Bulla, Luana
Marinucci, Ludovica
Mongiovi, Misael
Presutti, Valentina
[J]. MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE (LOD 2021), PT II, 2022, 13164 : 193 - 197
[24] DACSA: A large-scale Dataset for Automatic summarization of Catalan and Spanish newspaper Articles
Segarra, Encarna
Ahuir, Vicent
Hurtado, Lluis-F
Angel Gonzalez, Jose
[J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5931 - 5943
[25] An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition
George Tsatsaronis
Georgios Balikas
Prodromos Malakasiotis
Ioannis Partalas
Matthias Zschunke
Michael R Alvers
Dirk Weissenborn
Anastasia Krithara
Sergios Petridis
Dimitris Polychronopoulos
Yannis Almirantis
John Pavlopoulos
Nicolas Baskiotis
Patrick Gallinari
Thierry Artiéres
Axel-Cyrille Ngonga Ngomo
Norman Heino
Eric Gaussier
Liliana Barrio-Alvers
Michael Schroeder
Ion Androutsopoulos
Georgios Paliouras
[J]. BMC Bioinformatics, 16
[26] CFO: Conditional Focused Neural Question Answering with Large-scale Knowledge Bases
Dai, Zihang
Li, Lei
Xu, Wei
[J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 800 - 810
[27] An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition
Tsatsaronis, George
Balikas, Georgios
Malakasiotis, Prodromos
Partalas, Ioannis
Zschunke, Matthias
Alvers, Michael R.
Weissenborn, Dirk
Krithara, Anastasia
Petridis, Sergios
Polychronopoulos, Dimitris
Almirantis, Yannis
Pavlopoulos, John
Baskiotis, Nicolas
Gallinari, Patrick
Artieres, Thierry
Ngomo, Axel-Cyrille Ngonga
Heino, Norman
Gaussier, Eric
Barrio-Alvers, Liliana
Schroeder, Michael
Androutsopoulos, Ion
Paliouras, Georgios
[J]. BMC BIOINFORMATICS, 2015, 16
[28] Arabic Question Answering System for Information Retrieval on Large-scale Image Objects
Al-Zubi, Sawsan
Awaysheh, Feras M.
Al-Shboul, Bashar Awad
[J]. 2021 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT DATA SCIENCE TECHNOLOGIES AND APPLICATIONS (IDSTA), 2021, : 162 - 170
[29] Extractive Text Summarization on Large-scale Dataset Using K-Means Clustering
Ti-Hon Nguyen
Thanh-Nghi Do
[J]. ADVANCES AND TRENDS IN ARTIFICIAL INTELLIGENCE: THEORY AND PRACTICES IN ARTIFICIAL INTELLIGENCE, 2022, 13343 : 737 - 746
[30] An astronomical question answering dataset for evaluating large language models
Jie Li
Fuyong Zhao
Panfeng Chen
Jiafu Xie
Xiangrui Zhang
Hui Li
Mei Chen
Yanhao Wang
Ming Zhu
[J]. Scientific Data, 12 (1)

← 1 2 3 4 5 →