Optimizing training data for persona-grounded dialogue via Synthetic Label Augmentation

被引:0
|
作者
Lee, Chanhee [1 ,2 ]
Kim, Donghyun [1 ]
Kim, Wongyu [1 ]
Lee, Kyungchan [1 ]
Ahn, Youbin [1 ]
Lee, Kyong-Ho [1 ]
Shin, Donghoon [3 ]
Lee, Yeonsoo [4 ]
机构
[1] Yonsei Univ, Dept Comp Sci, Seoul, South Korea
[2] Samsung Secur, Seoul, South Korea
[3] KT, Seongnam Si, Gyeonggi do, South Korea
[4] NCSOFT, Seongnam Si, Gyeonggi do, South Korea
关键词
Persona-grounded dialogue; Persona expansion; Data optimization; Synthetic augmentation;
D O I
10.1016/j.eswa.2024.125796
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Persona-grounded dialogue systems aim to enhance the quality of AI agent responses by bolstering persona consistency and promoting response diversity. Although model tuning has seen significant advancements, there is an ongoing need to refine the training data itself. Expanding the scope of personas has been suggested as a means to bridge this gap. Nevertheless, the lack of gold labels that align with these expanded personas poses a challenge for AI agents in training the extent of real-world knowledge. To tackle these challenges, we propose the Synthetic Label Augmentation framework. This framework (1) creates a background skeleton from the original gold labels, masking persona-related elements, (2) infuses the background skeleton with expanded-persona features, generating synthetic gold labels, (3) identifies the most appropriate synthetic gold labels among the candidates, and (4) merges them into persona-grounded dialogue dataset. Through extensive experiments on the Persona-Chat, we demonstrate that the proposed framework effectively integrates the content of expanded personas to generate synthetic gold labels suitable for the dialogue context. Furthermore, response generation experiments using the Optimized Persona-Chat show that our framework significantly enhances AI agents' performance in terms of persona consistency and response diversity.
引用
收藏
页数:11
相关论文
共 38 条
  • [31] Gradual Domain Adaptation with Pseudo-Label Denoising for SAR Target Recognition When Using Only Synthetic Data for Training
    Sun, Yuanshuang
    Wang, Yinghua
    Liu, Hongwei
    Hu, Liping
    Zhang, Chen
    Wang, Siyuan
    REMOTE SENSING, 2023, 15 (03)
  • [32] UniDE: A multi-level and low-resource framework for automatic dialogue evaluation via LLM-based data augmentation and multitask learning
    Ye, Guanghui
    Zhao, Huan
    Zhang, Zixing
    Jiang, Zhihua
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
  • [33] Real-Time Activity Detection of Human Movement in Videos via Smartphone Based on Synthetic Training Data
    Thomanek, Rico
    Rolletschke, Tony
    Platte, Benny
    Hoesel, Claudia
    Roschke, Christian
    Manthey, Robert
    Heinzig, Manuel
    Vogel, Richard
    Zimmer, Frank
    Vodel, Matthias
    Eibl, Maximilian
    Ritter, Marc
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2020, : 160 - 164
  • [34] Robust Cardiac MRI Segmentation with Data-Centric Models to Improve Performance via Intensive Pre-training and Augmentation
    Gong, Shizhan
    Lu, Weitao
    Xie, Jize
    Zhang, Xiaofan
    Zhang, Shaoting
    Dou, Qi
    STATISTICAL ATLASES AND COMPUTATIONAL MODELS OF THE HEART: REGULAR AND CMRXMOTION CHALLENGE PAPERS, STACOM 2022, 2022, 13593 : 494 - 504
  • [35] 3D-aware Facial Landmark Detection via Multi-view Consistent Training on Synthetic Data
    Zeng, Libing
    Chen, Lele
    Bao, Wentao
    Li, Zhong
    Xu, Yi
    Yuan, Junsong
    Kalantari, Nima K.
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 12747 - 12758
  • [36] Domain Adaptive Semantic Segmentation of Remote Sensing Images via Self-Training-Based Dual-Level Data Augmentation
    Hu, Xiaoxing
    Wang, Yupei
    Chen, Liang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 19713 - 19729
  • [37] Collagen fiber centerline tracking in fibrotic tissue via deep neural networks with variational autoencoder-based synthetic training data generation
    Park, Hyojoon
    Li, Bin
    Liu, Yuming
    Nelson, Michael S.
    Wilson, Helen M.
    Sifakis, Eftychios
    Eliceiri, Kevin W.
    MEDICAL IMAGE ANALYSIS, 2023, 90
  • [38] Cardiac CT motion artifact grading via semi-automatic labeling and vessel tracking using synthetic image-augmented training data
    Xu, Yongshun
    Sushmit, Asif
    Lyu, Qing
    Li, Ying
    Cao, Ximiao
    Maltz, Jonathan S.
    Wang, Ge
    Yu, Hengyong
    JOURNAL OF X-RAY SCIENCE AND TECHNOLOGY, 2022, 30 (03) : 433 - 445