DOCNLI: A Large-scale Dataset for Document-level Natural Language Inference

被引：0

作者：

Yin, Wenpeng ^{[1
]}

Radev, Dragomir ^{[1
,2
]}

Xiong, Caiming ^{[1
]}

机构：

[1] Salesforce Res, Palo Alto, CA 94301 USA

[2] Yale Univ, New Haven, CT 06520 USA

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021 | 2021年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems such as relation extraction, question answering, summarization, etc. It has been studied intensively in the past few years thanks to the availability of large-scale labeled datasets. However, most existing studies focus on merely sentence-level inference, which limits the scope of NLI's application in downstream NLP problems. This work presents DOCNLI - a newly-constructed large-scale dataset for document-level NLI. DOCNLI is transformed from a broad range of NLP problems and covers multiple genres of text. The premises always stay in the document granularity, whereas the hypotheses vary in length from single sentences to passages with hundreds of words. Additionally, DOCNLI has pretty limited artifacts1 which unfortunately widely exist in some popular sentence-level NLI datasets. Our experiments demonstrate that, even without fine-tuning, a model pre-trained on DOCNLI shows promising performance on popular sentence-level benchmarks, and generalizes well to out-of-domain NLP tasks that rely on inference at document granularity. Task-specific fine-tuning can bring further improvements. Data, code and pretrained models can be found at https://github. com/salesforce/DocNLI.

引用

页码：4913 / 4922

页数：10

共 50 条

[1] ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
Koreeda, Yuta
Manning, Christopher D.
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1907 - 1919
[2] DocRED: A Large-Scale Document-Level Relation Extraction Dataset
Yao, Yuan
Ye, Deming
Li, Peng
Han, Xu
Lin, Yankai
Liu, Zhenghao
Liu, Zhiyuan
Huang, Lixin
Zhou, Jie
Sun, Maosong
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 764 - 777
[3] DuEE-Fin: A Large-Scale Dataset for Document-Level Event Extraction
Han, Cuiyun
Zhang, Jinchuan
Li, Xinyu
Xu, Guojin
Peng, Weihua
Zeng, Zengfeng
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT I, 2022, 13551 : 172 - 183
[4] A large-scale dataset for korean document-level relation extraction from encyclopedia texts
Son, Suhyune
Lim, Jungwoo
Koo, Seonmin
Kim, Jinsung
Kim, Younghoon
Lim, Youngsik
Hyun, Dongseok
Lim, Heuiseok
APPLIED INTELLIGENCE, 2024, 54 (17-18) : 8681 - 8701
[5] Document-Level Machine Translation with Large Language Models
Wang, Longyue
Lyu, Chenyang
Ji, Tianbo
Zhang, Zhirui
Yu, Dian
Shi, Shuming
Tu, Zhaopeng
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 16646 - 16661
[6] MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Kudugunta, Sneha
Caswell, Isaac
Zhang, Biao
Garcia, Xavier
Xin, Derrick
Kusupati, Aditya
Stella, Romi
Bapna, Ankur
Firat, Orhan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[7] DocEE: A Large-Scale and Fine-grained Benchmark for Document-level Event Extraction
Tong, Meihan
Xu, Bin
Wang, Shuai
Han, Meihuan
Cao, Yixin
Zhu, Jiangqi
Chen, Siyu
Hou, Lei
Li, Juanzi
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3970 - 3982
[8] AutoRE: Document-Level Relation Extraction with Large Language Models
Xue, Lilong
Zhang, Dan
Dong, Yuxiao
Tang, Jie
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 3: SYSTEM DEMONSTRATIONS, 2024, : 211 - 220
[9] Microsoft Translator at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation
Junczys-Dowmunt, Marcin
FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), 2019, : 225 - 233
[10] Towards a large-scale person search by vietnamese natural language: dataset and methods
Thi Thanh Thuy Pham
Hong-Quan Nguyen
Hoai Phan
Thi-Ngoc-Diep Do
Thuy-Binh Nguyen
Thanh-Hai Tran
Thi-Lan Le
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (19) : 27569 - 27600

← 1 2 3 4 5 →