MEDIASUM: A Large-scale Media Interview Dataset for Dialogue Summarization

被引:0
|
作者
Zhu, Chenguang [1 ]
Liu, Yang [1 ]
Mei, Jie [1 ]
Zeng, Michael [1 ]
机构
[1] Microsoft Cognit Serv Res Grp, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper introduces MEDIASUM(1), a large-scale media interview dataset consisting of 463.6K transcripts with abstractive summaries. To create this dataset, we collect interview transcripts from NPR and CNN and employ the overview and topic descriptions as summaries. Compared with existing public corpora for dialogue summarization, our dataset is an order of magnitude larger and contains complex multi-party conversations from multiple domains. We conduct statistical analysis to demonstrate the unique positional bias exhibited in the transcripts of televised and radioed interviews. We also show that MEDIASUM can be used in transfer learning to improve a model's performance on other dialogue summarization tasks.
引用
收藏
页码:5927 / 5934
页数:8
相关论文
共 50 条
  • [31] Pchatbot: A Large-Scale Dataset for Personalized Chatbot
    Qian, Hongjin
    Li, Xiaohe
    Zhong, Hanxun
    Guo, Yu
    Ma, Yueyuan
    Zhu, Yutao
    Liu, Zhanliang
    Dou, Zhicheng
    Wen, Ji-Rong
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 2470 - 2477
  • [32] openDD: A Large-Scale Roundabout Drone Dataset
    Breuer, Antonia
    Termoehlen, Jan-Aike
    Homoceanu, Silviu
    Fingscheidt, Tim
    [J]. 2020 IEEE 23RD INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2020,
  • [33] PatchDB: A Large-Scale Security Patch Dataset
    Wang, Xinda
    Wang, Shu
    Feng, Pengbin
    Sun, Kun
    Jajodia, Sushil
    [J]. 51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN 2021), 2021, : 149 - 160
  • [34] Large-Scale Analysis of the Docker Hub Dataset
    Zhao, Nannan
    Tarasov, Vasily
    Albahar, Hadeel
    Anwar, Ali
    Rupprecht, Lukas
    Skourtis, Dimitrios
    Warke, Amit S.
    Mohamed, Mohamed
    Butt, Ali R.
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 215 - 224
  • [35] A large-scale dataset of buildings and construction sites
    Cheng, Xuanhao
    Jia, Mingming
    He, Jian
    [J]. COMPUTER-AIDED CIVIL AND INFRASTRUCTURE ENGINEERING, 2024, 39 (09) : 1390 - 1406
  • [36] SGF: A Crowdsourced Large-scale Event Dataset
    Heuschkel, Jens
    Froemmgen, Alexander
    [J]. PROCEEDINGS OF THE 9TH ACM MULTIMEDIA SYSTEMS CONFERENCE (MMSYS'18), 2018, : 351 - 356
  • [37] MineRL: A Large-Scale Dataset of Minecraft Demonstrations
    Guss, William H.
    Houghton, Brandon
    Topin, Nicholay
    Wang, Phillip
    Codel, Cayden
    Veloso, Manuela
    Salakhutdinov, Ruslan
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 2442 - 2448
  • [38] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
    Wang, Josiah
    Figueiredo, Josiel
    Specia, Lucia
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
  • [39] EdNet: A Large-Scale Hierarchical Dataset in Education
    Choi, Youngduck
    Lee, Youngnam
    Shin, Dongmin
    Cho, Junghyun
    Park, Seoyon
    Lee, Seewoo
    Baek, Jineon
    Bae, Chan
    Kim, Byungsoo
    Heo, Jaewe
    [J]. ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2020), PT II, 2020, 12164 : 69 - 73
  • [40] VoxCeleb: a large-scale speaker identification dataset
    Nagrani, Arsha
    Chung, Joon Son
    Zisserman, Andrew
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2616 - 2620