MEDIASUM: A Large-scale Media Interview Dataset for Dialogue Summarization

被引:0
|
作者
Zhu, Chenguang [1 ]
Liu, Yang [1 ]
Mei, Jie [1 ]
Zeng, Michael [1 ]
机构
[1] Microsoft Cognit Serv Res Grp, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper introduces MEDIASUM(1), a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic descriptions as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MEDIASUM can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.
引用
收藏
页码:5927 / 5934
页数:8
相关论文
共 50 条
  • [1] BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization
    Sharma, Eva
    Li, Chen
    Wang, Lu
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2204 - 2213
  • [2] MCLS: A Large-Scale Multimodal Cross-Lingual Summarization Dataset
    Shi, Xiaorui
    [J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 273 - 288
  • [3] Liputan6: A Large-scale Indonesian Dataset for Text Summarization
    Koto, Fajri
    Lau, Jey Han
    Baldwin, Timothy
    [J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 598 - 608
  • [4] Mr. HiSum: A Large-scale Dataset for Video Highlight Detection and Summarization
    Sul, Jinhwan
    Han, Jihoon
    Lee, Joonseok
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] DACSA: A large-scale Dataset for Automatic summarization of Catalan and Spanish newspaper Articles
    Segarra, Encarna
    Ahuir, Vicent
    Hurtado, Lluis-F
    Angel Gonzalez, Jose
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5931 - 5943
  • [6] MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization
    Xu, Canwen
    Pei, Jiaxin
    Wu, Hongtao
    Liu, Yiyu
    Li, Chenliang
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3586 - 3596
  • [7] Extractive Text Summarization on Large-scale Dataset Using K-Means Clustering
    Ti-Hon Nguyen
    Thanh-Nghi Do
    [J]. ADVANCES AND TRENDS IN ARTIFICIAL INTELLIGENCE: THEORY AND PRACTICES IN ARTIFICIAL INTELLIGENCE, 2022, 13343 : 737 - 746
  • [8] Video Summarization: How to Use Deep-Learned Features Without a Large-Scale Dataset
    Purwanto, Didik
    Chen, Yie-Tarng
    Fang, Wen-Hsien
    Wu, Wen-chi
    [J]. 2018 9TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE AND TECHNOLOGY (ICAST), 2018, : 220 - 225
  • [9] RNSum: A Large-Scale Dataset for Automatic Release Note Generation via Commit Logs Summarization
    Kamezawa, Hisashi
    Nishida, Noriki
    Shimizu, Nobuyuki
    Miyazaki, Takashi
    Nakayama, Hideki
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 8718 - 8735
  • [10] A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal
    Ghalandari, Demian Gholipour
    Hokamp, Chris
    Nghia The Pham
    Glover, John
    Ifrim, Georgiana
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 1302 - 1308