DOCNLI: A Large-scale Dataset for Document-level Natural Language Inference

被引:0
|
作者
Yin, Wenpeng [1 ]
Radev, Dragomir [1 ,2 ]
Xiong, Caiming [1 ]
机构
[1] Salesforce Res, Palo Alto, CA 94301 USA
[2] Yale Univ, New Haven, CT 06520 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems such as relation extraction, question answering, summarization, etc. It has been studied intensively in the past few years thanks to the availability of large-scale labeled datasets. However, most existing studies focus on merely sentence-level inference, which limits the scope of NLI's application in downstream NLP problems. This work presents DOCNLI - a newly-constructed large-scale dataset for document-level NLI. DOCNLI is transformed from a broad range of NLP problems and covers multiple genres of text. The premises always stay in the document granularity, whereas the hypotheses vary in length from single sentences to passages with hundreds of words. Additionally, DOCNLI has pretty limited artifacts1 which unfortunately widely exist in some popular sentence-level NLI datasets. Our experiments demonstrate that, even without fine-tuning, a model pre-trained on DOCNLI shows promising performance on popular sentence-level benchmarks, and generalizes well to out-of-domain NLP tasks that rely on inference at document granularity. Task-specific fine-tuning can bring further improvements. Data, code and pretrained models can be found at https://github. com/salesforce/DocNLI.
引用
收藏
页码:4913 / 4922
页数:10
相关论文
共 50 条
  • [1] ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
    Koreeda, Yuta
    Manning, Christopher D.
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1907 - 1919
  • [2] DocRED: A Large-Scale Document-Level Relation Extraction Dataset
    Yao, Yuan
    Ye, Deming
    Li, Peng
    Han, Xu
    Lin, Yankai
    Liu, Zhenghao
    Liu, Zhiyuan
    Huang, Lixin
    Zhou, Jie
    Sun, Maosong
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 764 - 777
  • [3] DuEE-Fin: A Large-Scale Dataset for Document-Level Event Extraction
    Han, Cuiyun
    Zhang, Jinchuan
    Li, Xinyu
    Xu, Guojin
    Peng, Weihua
    Zeng, Zengfeng
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT I, 2022, 13551 : 172 - 183
  • [4] A large-scale dataset for korean document-level relation extraction from encyclopedia texts
    Son, Suhyune
    Lim, Jungwoo
    Koo, Seonmin
    Kim, Jinsung
    Kim, Younghoon
    Lim, Youngsik
    Hyun, Dongseok
    Lim, Heuiseok
    APPLIED INTELLIGENCE, 2024, 54 (17-18) : 8681 - 8701
  • [5] Document-Level Machine Translation with Large Language Models
    Wang, Longyue
    Lyu, Chenyang
    Ji, Tianbo
    Zhang, Zhirui
    Yu, Dian
    Shi, Shuming
    Tu, Zhaopeng
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 16646 - 16661
  • [6] MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
    Kudugunta, Sneha
    Caswell, Isaac
    Zhang, Biao
    Garcia, Xavier
    Xin, Derrick
    Kusupati, Aditya
    Stella, Romi
    Bapna, Ankur
    Firat, Orhan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] DocEE: A Large-Scale and Fine-grained Benchmark for Document-level Event Extraction
    Tong, Meihan
    Xu, Bin
    Wang, Shuai
    Han, Meihuan
    Cao, Yixin
    Zhu, Jiangqi
    Chen, Siyu
    Hou, Lei
    Li, Juanzi
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3970 - 3982
  • [8] AutoRE: Document-Level Relation Extraction with Large Language Models
    Xue, Lilong
    Zhang, Dan
    Dong, Yuxiao
    Tang, Jie
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 3: SYSTEM DEMONSTRATIONS, 2024, : 211 - 220
  • [9] Microsoft Translator at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation
    Junczys-Dowmunt, Marcin
    FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), 2019, : 225 - 233
  • [10] Towards a large-scale person search by vietnamese natural language: dataset and methods
    Thi Thanh Thuy Pham
    Hong-Quan Nguyen
    Hoai Phan
    Thi-Ngoc-Diep Do
    Thuy-Binh Nguyen
    Thanh-Hai Tran
    Thi-Lan Le
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (19) : 27569 - 27600