Optimizing training data for persona-grounded dialogue via Synthetic Label Augmentation

被引:0
|
作者
Lee, Chanhee [1 ,2 ]
Kim, Donghyun [1 ]
Kim, Wongyu [1 ]
Lee, Kyungchan [1 ]
Ahn, Youbin [1 ]
Lee, Kyong-Ho [1 ]
Shin, Donghoon [3 ]
Lee, Yeonsoo [4 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seoul, South Korea
[2] Samsung Secur, Seoul, South Korea
[3] KT, Seongnam Si, Gyeonggi do, South Korea
[4] NCSOFT, Seongnam Si, Gyeonggi do, South Korea
关键词
Persona-grounded dialogue; Persona expansion; Data optimization; Synthetic augmentation;
D O I
10.1016/j.eswa.2024.125796
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Persona-grounded dialogue systems aim to enhance the quality of AI agent responses by bolstering persona consistency and promoting response diversity. Although model tuning has seen significant advancements, there is an ongoing need to refine the training data itself. Expanding the scope of personas has been suggested as a means to bridge this gap. Nevertheless, the lack of gold labels that align with these expanded personas poses a challenge for AI agents in training the extent of real-world knowledge. To tackle these challenges, we propose the Synthetic Label Augmentation framework. This framework (1) creates a background skeleton from the original gold labels, masking persona-related elements, (2) infuses the background skeleton with expanded-persona features, generating synthetic gold labels, (3) identifies the most appropriate synthetic gold labels among the candidates, and (4) merges them into persona-grounded dialogue dataset. Through extensive experiments on the Persona-Chat, we demonstrate that the proposed framework effectively integrates the content of expanded personas to generate synthetic gold labels suitable for the dialogue context. Furthermore, response generation experiments using the Optimized Persona-Chat show that our framework significantly enhances AI agents' performance in terms of persona consistency and response diversity.
引用
收藏
页数:11
相关论文
共 38 条
  • [21] Intraoperative detection of parathyroid glands using artificial intelligence: optimizing medical image training with data augmentation methods
    Lee, Joon-Hyop
    Ku, EunKyung
    Chung, Yoo Seung
    Kim, Young Jae
    Kim, Kwang Gi
    SURGICAL ENDOSCOPY AND OTHER INTERVENTIONAL TECHNIQUES, 2024, 38 (10): : 5732 - 5745
  • [22] Inverse Biomechanical Modeling of the Tongue via Machine Learning and Synthetic Training Data
    Tolpadi, Aniket A.
    Stone, Maureen L.
    Carass, Aaron
    Prince, Jerry L.
    Gomez, Arnold D.
    MEDICAL IMAGING 2018: IMAGE-GUIDED PROCEDURES, ROBOTIC INTERVENTIONS, AND MODELING, 2018, 10576
  • [23] Correcting the Autocorrect: Context-Aware Typographical Error Correction via Training Data Augmentation
    Shah, Kshitij
    de Melo, Gerard
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6930 - 6936
  • [24] Robust Recognition of Conversational Telephone Speech via Multi-condition Training and Data Augmentation
    Malek, Jiri
    Zdansky, Jindrich
    Cerva, Petr
    TEXT, SPEECH, AND DIALOGUE (TSD 2018), 2018, 11107 : 324 - 333
  • [25] FedLC: Optimizing Federated Learning in Non-IID Data via Label-Wise Clustering
    Lee, Hunmin
    Seo, Daehee
    IEEE ACCESS, 2023, 11 : 42082 - 42095
  • [26] Easy and effective! Data augmentation for knowledge-aware dialogue generation via multi-perspective sentences interaction☆
    Peng, Sisi
    Qu, Dan
    Zhang, Wenlin
    Zhang, Hao
    Li, Shunhang
    Xu, Minchen
    NEUROCOMPUTING, 2025, 614
  • [27] Arbitrary View Action Recognition via Transfer Dictionary Learning on Synthetic Training Data
    Zhang, Jingtian
    Zhang, Lining
    Shum, Hubert P. H.
    Shao, Ling
    2016 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2016, : 1678 - 1684
  • [28] Training Deep Learning Models via Synthetic Data: Application in Unmanned Aerial Vehicles
    Kamilaris, Andreas
    van den Brink, Corjan
    Karatsiolis, Savvas
    COMPUTER ANALYSIS OF IMAGES AND PATTERNS (CAIP 2019), 2019, 1089 : 81 - 90
  • [29] Overcoming Data Scarcity in Speaker Identification: Dataset Augmentation with Synthetic MFCCs via Character-level RNN
    Bird, Jordan J.
    Faria, Diego R.
    Premebida, Cristiano
    Ekart, Aniko
    Ayrosa, Pedro P. S.
    2020 IEEE INTERNATIONAL CONFERENCE ON AUTONOMOUS ROBOT SYSTEMS AND COMPETITIONS (ICARSC 2020), 2020, : 146 - 151
  • [30] Evaluation of data augmentation via synthetic images for improved breast mass detection on mammograms using deep learning
    Cha, Kenny H.
    Petrick, Nicholas
    Pezeshk, Aria
    Graff, Christian G.
    Sharma, Diksha
    Badal, Andreu
    Sahiner, Berkman
    JOURNAL OF MEDICAL IMAGING, 2020, 7 (01)