Unified Multimodal Model with Unlikelihood Training for Visual Dialog

被引:6
|
作者
Wang, Zihao [1 ]
Wang, Junli [1 ]
Jiang, Changjun [1 ]
机构
[1] Tongji Univ, Natl Prov Minist Joint Collaborat Innovat Ctr Fin, Key Lab Embedded Syst & Serv Comp, Minist Educ, Shanghai, Peoples R China
关键词
Visual Dialog; Vision and Language; Unlikelihood Training;
D O I
10.1145/3503161.3547974
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The task of visual dialog requires a multimodal chatbot to answer sequential questions from humans about image content. Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers). However, the likelihood objective often leads to frequent and dull outputs and fails to exploit the useful knowledge from negative instances (involving incorrect answers). In this paper, we propose a Unified Multimodal Model with UnLikelihood Training, named UniMM-UL, to tackle this problem. First, to improve visual dialog understanding and generation by multi-task learning, our model extends ViLBERT from only supporting answer discrimination to holding both answer discrimination and answer generation seamlessly by different attention masks. Specifically, in order to make the original discriminative model compatible with answer generation, we design novel generative attention masks to implement the autoregressive Masked Language Modeling (autoregressive MLM) task. And to attenuate the adverse effects of the likelihood objective, we exploit unlikelihood training on negative instances to make the model less likely to generate incorrect answers. Then, to utilize dense annotations, we adopt different fine-tuning methods for both generating and discriminating answers, rather than just for discriminating answers as in the prior work. Finally, on the VisDial dataset, our model achieves the best generative results (69.23 NDCG score). And our model also yields comparable discriminative results with the state-of-the-art in both single-model and ensemble settings (75.92 and 76.17 NDCG scores).
引用
收藏
页码:4625 / 4634
页数:10
相关论文
共 50 条
  • [1] Unified Dialog Model Pre-training for Task-Oriented Dialog Understanding and Generation
    He, Wanwei
    Dai, Yinpei
    Yang, Min
    Sun, Jian
    Huang, Fei
    Si, Luo
    Li, Yongbin
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 187 - 200
  • [2] VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG
    Ye, Tong
    Si, Shijing
    Wang, Jianzong
    Wang, Rui
    Cheng, Ning
    Xiao, Jing
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6687 - 6691
  • [3] SUBSTITUTION FOR A RESTRICTED VISUAL CHANNEL IN MULTIMODAL COMPUTER - HUMAN DIALOG
    HILL, DR
    GRIEB, C
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1988, 18 (02): : 285 - 304
  • [4] Structure-Aware Multimodal Sequential Learning for Visual Dialog
    Kim, Young-Jin
    Kim, Min-Jun
    An, Kyunghwan
    Ahn, Jinwoo
    Kim, Jaeseok
    Heo, Yu-Jung
    Chang, Du-Seong
    Kim, Eun-Sol
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 12, 2024, : 13193 - 13201
  • [5] A novel dialog model for the design of multimodal user interfaces
    Schaefer, R
    Bleul, S
    Mueller, W
    ENGINEERING HUMAN COMPUTER INTERACTION AND INTERACTIVE SYSTEMS, 2005, 3425 : 221 - 223
  • [6] The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training
    Kang, Gi-Cheon
    Kim, Sungdong
    Kim, Jin-Hwa
    Kwak, Donghyun
    Zhang, Byoung-Tak
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6746 - 6756
  • [7] Multimodal Contrastive Training for Visual Representation Learning
    Yuan, Xin
    Lin, Zhe
    Kuen, Jason
    Zhang, Jianming
    Wang, Yilin
    Maire, Michael
    Kale, Ajinkya
    Faieta, Baldo
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 6991 - 7000
  • [8] The Visual Aspect of Translation Training in Multimodal Texts
    Damaskinidis, George
    META, 2016, 61 (02) : 299 - 319
  • [9] UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
    Chen, Cheng
    Tan, Zhenshan
    Cheng, Qingrong
    Jiang, Xin
    Liu, Qun
    Zhu, Yudong
    Gu, Xiaodong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18082 - 18091
  • [10] Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog
    Zhang, Jiaping
    Zhao, Tiancheng
    Yu, Zhou
    19TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2018), 2018, : 140 - 150