A Large-Scale Corpus for Conversation Disentanglement

被引:0
|
作者
Kummerfeld, Jonathan K. [1 ]
Athreya, Vignesh [1 ]
Patel, Siva Sankalp [2 ]
Gouravajhala, Sai R. [1 ]
Gunasekara, Chulaka [2 ]
Polymenakos, Lazaros [2 ]
Peper, Joseph J. [1 ]
Ganhotra, Jatin [2 ]
Lasecki, Walter S. [1 ]
机构
[1] Univ Michigan, Comp Sci & Engn, Ann Arbor, MI 48109 USA
[2] IBM Res AI, TJ Watson Res Ctr, Yorktown Hts, NY USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Disentangling conversations mixed together in a single stream of messages is a difficult task, made harder by the lack of large manually annotated datasets. We created a new dataset of 77,563 messages manually annotated with reply-structure graphs that both disentangle conversations and define internal conversation structure. Our dataset is 16 times larger than all previously released datasets combined, the first to include adjudication of annotation disagreements, and the first to include context. We use our data to re-examine prior work, in particular, finding that 80% of conversations in a widely used dialogue corpus are either missing messages or contain extra messages. Our manually-annotated data presents an opportunity to develop robust data-driven methods for conversation disentanglement, which will help advance dialogue research.
引用
收藏
页码:3846 / 3856
页数:11
相关论文
共 50 条
  • [1] Conversation thumbnails for large-scale discussions
    IBM Research, One Rogers Street, Cambridge, MA 02142, United States
    [J]. Assoc. Comput. Mach., Spec. Interest Group Comput.-; Hum. Interact. (ACM SIGCHI), 1600, 742-743 (2003):
  • [2] A Corpus for Large-Scale Phonetic Typology
    Salesky, Elizabeth
    Chodroff, Eleanor
    Pimentel, Tiago
    Wiesner, Matthew
    Cotterell, Ryan
    Black, Alan W.
    Eisner, Jason
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 4526 - 4546
  • [3] Sentiment Classification with Convolutional Neural Networks: an Experimental Study on a Large-scale Chinese Conversation Corpus
    Zhang, Lei
    Chen, Chengcai
    [J]. PROCEEDINGS OF 2016 12TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2016, : 165 - 169
  • [4] Large-Scale Multimodal Movie Dialogue Corpus
    Yasuhara, Ryu
    Inoue, Masashi
    Suga, Ikuya
    Kosaka, Tetsuo
    [J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 414 - 415
  • [5] Vocal development in a large-scale crosslinguistic corpus
    Cychosz, Margaret
    Cristia, Alejandrina
    Bergelson, Elika
    Casillas, Marisa
    Baudet, Gladys
    Warlaumont, Anne S.
    Scaff, Camila
    Yankowitz, Lisa
    Seidl, Amanda
    [J]. DEVELOPMENTAL SCIENCE, 2021, 24 (05)
  • [6] A Phrase Topic Model for Large-scale Corpus
    Li, Baoji
    Xu, Wenhua
    Tian, Yuhui
    Chen, Juan
    [J]. 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA), 2019, : 634 - 639
  • [7] A Large-Scale Query Spelling Correction Corpus
    Hagen, Matthias
    Potthast, Martin
    Gohsen, Marcel
    Rathgeber, Anja
    Stein, Benno
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 1261 - 1264
  • [8] Conversation map: An interface for very large-scale conversations
    Sack, W
    [J]. JOURNAL OF MANAGEMENT INFORMATION SYSTEMS, 2000, 17 (03) : 73 - 92
  • [9] Build a large-scale syntactically annotated Chinese corpus
    Qiang, Z
    [J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2003, 2807 : 106 - 113
  • [10] Development of a Large-Scale Mandarin Radio Speech Corpus
    Chang, Yung-hsiang Shawn
    Liao, Yuan-fu
    Wang, Sheng-ming
    Wang, Jenq-haur
    Wang, Sing-yue
    Chen, Jhih-wei
    Chen, You-dian
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), 2017,