BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization

被引:0
|
作者
Sharma, Eva [1 ]
Li, Chen [2 ]
Wang, Lu [1 ]
机构
[1] Northeastern Univ, Khoury Coll Comp Sci, Boston, MA 02115 USA
[2] Tencent AI Lab, Bellevue, WA USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most existing text summarization datasets are compiled from the news domain, where summaries have a flattened discourse structure. In such datasets, summary-worthy content often appears in the beginning of input articles. Moreover, large segments from input articles are present verbatim in their respective summaries. These issues impede the learning and evaluation of systems that can understand an article's global content structure as well as produce abstractive summaries with high compression ratio. In this work, we present a novel dataset, BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Compared to existing summarization datasets, BIGPATENT has the following properties: i) summaries contain a richer discourse structure with more recurring entities, ii) salient content is evenly distributed in the input, and iii) lesser and shorter extractive fragments are present in the summaries. Finally, we train and evaluate baselines and popular learning models on BIGPATENT to shed light on new challenges and motivate future directions for summarization research.
引用
收藏
页码:2204 / 2213
页数:10
相关论文
共 50 条
  • [1] Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model
    Fabbri, Alexander R.
    Li, Irene
    She, Tianwei
    Li, Suyi
    Radev, Dragomir R.
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1074 - 1084
  • [2] XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages
    Hasan, Tahmid
    Bhattacharjee, Abhik
    Islam, Md Saiful
    Samin, Kazi
    Li, Yuan-Fang
    Kang, Yong-Bin
    Rahman, M. Sohel
    Shahriyar, Rifat
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4693 - 4703
  • [3] MEDIASUM: A Large-scale Media Interview Dataset for Dialogue Summarization
    Zhu, Chenguang
    Liu, Yang
    Mei, Jie
    Zeng, Michael
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5927 - 5934
  • [4] SummScreen: A Dataset for Abstractive Screenplay Summarization
    Chen, Mingda
    Chu, Zewei
    Wiseman, Sam
    Gimpel, Kevin
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 8602 - 8615
  • [5] Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian
    Batuhan Baykara
    Tunga Güngör
    [J]. Language Resources and Evaluation, 2022, 56 : 973 - 1007
  • [6] Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian
    Baykara, Batuhan
    Gungor, Tunga
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (03) : 973 - 1007
  • [7] MCLS: A Large-Scale Multimodal Cross-Lingual Summarization Dataset
    Shi, Xiaorui
    [J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 273 - 288
  • [8] Liputan6: A Large-scale Indonesian Dataset for Text Summarization
    Koto, Fajri
    Lau, Jey Han
    Baldwin, Timothy
    [J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 598 - 608
  • [9] Mr. HiSum: A Large-scale Dataset for Video Highlight Detection and Summarization
    Sul, Jinhwan
    Han, Jihoon
    Lee, Joonseok
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [10] DACSA: A large-scale Dataset for Automatic summarization of Catalan and Spanish newspaper Articles
    Segarra, Encarna
    Ahuir, Vicent
    Hurtado, Lluis-F
    Angel Gonzalez, Jose
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5931 - 5943