SuperDialseg: A Large-scale Dataset for Supervised Dialogue Segmentation

被引:0
|
作者
Jiang, Junfeng [1 ]
Dong, Chengzhang [2 ]
Kurohashi, Sadao [2 ,3 ]
Aizawa, Akiko [1 ,3 ]
机构
[1] Univ Tokyo, Tokyo, Japan
[2] Kyoto Univ, Kyoto, Japan
[3] Natl Inst Informat, Tokyo, Japan
关键词
TEXT; MODEL;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dialogue segmentation is a crucial task for dialogue systems allowing a better understanding of conversational texts. Despite recent progress in unsupervised dialogue segmentation methods, their performances are limited by the lack of explicit supervised signals for training. Furthermore, the precise definition of segmentation points in conversations still remains as a challenging problem, increasing the difficulty of collecting manual annotations. In this paper, we provide a feasible definition of dialogue segmentation points with the help of document-grounded dialogues and release a large-scale supervised dataset called SuperDialseg, containing 9,478 dialogues based on two prevalent document-grounded dialogue corpora, and also inherit their useful dialogue-related annotations. Moreover, we provide a benchmark including 18 models across five categories for the dialogue segmentation task with several proper evaluation metrics. Empirical studies show that supervised learning is extremely effective in in-domain datasets and models trained on SuperDialseg can achieve good generalization ability on out-of-domain data. Additionally, we also conducted human verification on the test set and the Kappa score confirmed the quality of our automatically constructed dataset. We believe our work is an important step forward in the field of dialogue segmentation. Our codes and data can be found from: https://github.com/Coldog2333/SuperDialseg.
引用
收藏
页码:4086 / 4101
页数:16
相关论文
共 50 条
  • [41] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
    Wang, Josiah
    Figueiredo, Josiel
    Specia, Lucia
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
  • [42] A large-scale and global car dataset for verification
    Hu, Lingji
    Luo, Xingcheng
    Deng, Jianhua
    Lai, Fengjie
    Hu, Jian
    Yu, Yongbin
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ELECTRONIC TECHNOLOGY, 2016, 48 : 49 - 52
  • [43] EdNet: A Large-Scale Hierarchical Dataset in Education
    Choi, Youngduck
    Lee, Youngnam
    Shin, Dongmin
    Cho, Junghyun
    Park, Seoyon
    Lee, Seewoo
    Baek, Jineon
    Bae, Chan
    Kim, Byungsoo
    Heo, Jaewe
    ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2020), PT II, 2020, 12164 : 69 - 73
  • [44] A Large-Scale Dataset for Empathetic Response Generation
    Welivita, Anuradha
    Xie, Yubo
    Pu, Pearl
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1251 - 1264
  • [45] VoxCeleb: a large-scale speaker identification dataset
    Nagrani, Arsha
    Chung, Joon Son
    Zisserman, Andrew
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2616 - 2620
  • [46] A Large-scale Synthetic Pathological Dataset for Deep Learning-enabled Segmentation of Breast Cancer
    Ding, Kexin
    Zhou, Mu
    Wang, He
    Gevaert, Olivier
    Metaxas, Dimitris
    Zhang, Shaoting
    SCIENTIFIC DATA, 2023, 10 (01)
  • [47] Segmentation Quality Refinement in Large-Scale Medical Image Dataset with Crowd-Sourced Annotations
    Cychnerski, Jan
    Dziubich, Tomasz
    NEW TRENDS IN DATABASE AND INFORMATION SYSTEMS, ADBIS 2021, 2021, 1450 : 205 - 216
  • [48] A large-scale hyperspectral dataset for flower classification
    Zheng, Yongrong
    Zhang, Tao
    Fu, Ying
    KNOWLEDGE-BASED SYSTEMS, 2022, 236
  • [49] Dungeons and Data: A Large-Scale NetHack Dataset
    Hambro, Eric
    Raileanu, Roberta
    Rothermel, Danielle
    Mella, Vegard
    Rocktaschel, Tim
    Kuttler, Heinrich
    Murray, Naila
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [50] Meply: A Large-scale Dataset and Baseline Evaluations for Metastatic Perirectal Lymph Node Detection and Segmentation
    Guo, Weidong
    Zhang, Huantao
    Wan, Shouhong
    Zou, Bingbing
    Wang, Wanqin
    Qiu, Chenyang
    Li, Jun
    Jin, Peiquan
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XIV, 2025, 15044 : 358 - 370