SuperDialseg: A Large-scale Dataset for Supervised Dialogue Segmentation

被引:0
|
作者
Jiang, Junfeng [1 ]
Dong, Chengzhang [2 ]
Kurohashi, Sadao [2 ,3 ]
Aizawa, Akiko [1 ,3 ]
机构
[1] Univ Tokyo, Tokyo, Japan
[2] Kyoto Univ, Kyoto, Japan
[3] Natl Inst Informat, Tokyo, Japan
关键词
TEXT; MODEL;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dialogue segmentation is a crucial task for dialogue systems allowing a better understanding of conversational texts. Despite recent progress in unsupervised dialogue segmentation methods, their performances are limited by the lack of explicit supervised signals for training. Furthermore, the precise definition of segmentation points in conversations still remains as a challenging problem, increasing the difficulty of collecting manual annotations. In this paper, we provide a feasible definition of dialogue segmentation points with the help of document-grounded dialogues and release a large-scale supervised dataset called SuperDialseg, containing 9,478 dialogues based on two prevalent document-grounded dialogue corpora, and also inherit their useful dialogue-related annotations. Moreover, we provide a benchmark including 18 models across five categories for the dialogue segmentation task with several proper evaluation metrics. Empirical studies show that supervised learning is extremely effective in in-domain datasets and models trained on SuperDialseg can achieve good generalization ability on out-of-domain data. Additionally, we also conducted human verification on the test set and the Kappa score confirmed the quality of our automatically constructed dataset. We believe our work is an important step forward in the field of dialogue segmentation. Our codes and data can be found from: https://github.com/Coldog2333/SuperDialseg.
引用
收藏
页码:4086 / 4101
页数:16
相关论文
共 50 条
  • [1] A Large-Scale Dataset for Water Segmentation of SAR Satellite
    Kim, Myeung Un
    Oh, Han
    Lee, Seung-Jae
    Choi, Yeonju
    Han, Sanghyuck
    2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2021, : 9796 - 9801
  • [2] MEDIASUM: A Large-scale Media Interview Dataset for Dialogue Summarization
    Zhu, Chenguang
    Liu, Yang
    Mei, Jie
    Zeng, Michael
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5927 - 5934
  • [3] LSOIE: A Large-Scale Dataset for Supervised Open Information Extraction
    Solawetz, Jacob
    Larson, Stefan
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2595 - 2600
  • [4] Weakly Supervised Semantic Segmentation for Large-Scale Point Cloud
    Zhang, Yachao
    Li, Zonghao
    Xie, Yuan
    Qu, Yanyun
    Li, Cuihua
    Mei, Tao
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3421 - 3429
  • [5] A Large-Scale Dataset for Benchmarking Elevator Button Segmentation and Character Recognition
    Liu, Jianbang
    Fang, Yuqi
    Zhu, Delong
    Ma, Nachuan
    Pan, Jin
    Meng, Max Q-H
    2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 14018 - 14024
  • [6] RailPC: A large-scale railway point cloud semantic segmentation dataset
    Jiang, Tengping
    Li, Shiwei
    Zhang, Qinyu
    Wang, Guangshuai
    Zhang, Zequn
    Zeng, Fankun
    An, Peng
    Jin, Xin
    Liu, Shan
    Wang, Yongjun
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2024, 9 (06) : 1548 - 1560
  • [7] A large-scale remote sensing scene dataset construction for semantic segmentation
    Xu, LeiLei
    Shi, ShanQiu
    Liu, YuJun
    Zhang, Hao
    Wang, Dan
    Zhang, Lu
    Liang, Wan
    Chen, Hao
    INTERNATIONAL JOURNAL OF IMAGE AND DATA FUSION, 2023, 14 (04) : 299 - 323
  • [8] Lizard: A Large-Scale Dataset for Colonic Nuclear Instance Segmentation and Classification
    Graham, Simon
    Jahanifar, Mostafa
    Azam, Ayesha
    Nimir, Mohammed
    Tsang, Yee-Wah
    Dodd, Katherine
    Hero, Emily
    Sahota, Harvir
    Tank, Atisha
    Benes, Ksenija
    Wahab, Noorul
    Minhas, Fayyaz
    Raza, Shan E. Ahmed
    El Daly, Hesham
    Gopalakrishnan, Kishore
    Snead, David
    Rajpoot, Nasir
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 684 - 693
  • [9] Electrical Thermal Image Semantic Segmentation: Large-Scale Dataset and Baseline
    Wang, Futian
    Guo, Yin
    Li, Chenglong
    Lu, Andong
    Ding, Zhongfeng
    Tang, Jin
    Luo, Bin
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71
  • [10] Non-supervised Macro Segmentation of Large-scale TV Videos
    Bai, Hongliang
    Dong, Chengyu
    Wang, Lezi
    Qin, Gang
    Tao, Kun
    Chang, Xiaofu
    Dong, Yuan
    MULTIMEDIA ON MOBILE DEVICES 2011 AND MULTIMEDIA CONTENT ACCESS: ALGORITHMS AND SYSTEMS V, 2011, 7881