Unified Multimodal Model with Unlikelihood Training for Visual Dialog

被引:6
|
作者
Wang, Zihao [1 ]
Wang, Junli [1 ]
Jiang, Changjun [1 ]
机构
[1] Tongji Univ, Natl Prov Minist Joint Collaborat Innovat Ctr Fin, Key Lab Embedded Syst & Serv Comp, Minist Educ, Shanghai, Peoples R China
关键词
Visual Dialog; Vision and Language; Unlikelihood Training;
D O I
10.1145/3503161.3547974
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The task of visual dialog requires a multimodal chatbot to answer sequential questions from humans about image content. Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers). However, the likelihood objective often leads to frequent and dull outputs and fails to exploit the useful knowledge from negative instances (involving incorrect answers). In this paper, we propose a Unified Multimodal Model with UnLikelihood Training, named UniMM-UL, to tackle this problem. First, to improve visual dialog understanding and generation by multi-task learning, our model extends ViLBERT from only supporting answer discrimination to holding both answer discrimination and answer generation seamlessly by different attention masks. Specifically, in order to make the original discriminative model compatible with answer generation, we design novel generative attention masks to implement the autoregressive Masked Language Modeling (autoregressive MLM) task. And to attenuate the adverse effects of the likelihood objective, we exploit unlikelihood training on negative instances to make the model less likely to generate incorrect answers. Then, to utilize dense annotations, we adopt different fine-tuning methods for both generating and discriminating answers, rather than just for discriminating answers as in the prior work. Finally, on the VisDial dataset, our model achieves the best generative results (69.23 NDCG score). And our model also yields comparable discriminative results with the state-of-the-art in both single-model and ensemble settings (75.92 and 76.17 NDCG scores).
引用
收藏
页码:4625 / 4634
页数:10
相关论文
共 50 条
  • [31] Adaptive Decontamination of the Training Set: A Unified Formulation for Discriminative Visual Tracking
    Danelljan, Martin
    Hager, Gustav
    Khan, Fahad Shahbaz
    Felsberg, Michael
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1430 - 1438
  • [32] Adaptive Visual Memory Network for Visual Dialog
    Zhao L.
    Gao L.
    Song J.
    Gao, Lianli (juana.alian@gmail.com), 1600, Univ. of Electronic Science and Technology of China (50): : 749 - 753
  • [33] Dual Visual Attention Network for Visual Dialog
    Guo, Dan
    Wang, Hui
    Wang, Meng
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 4989 - 4995
  • [34] Towards a Unified Compositional Model for Visual Pattern Modeling
    Tang, Wei
    Yu, Pei
    Zhou, Jiahuan
    Wu, Ying
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2803 - 2812
  • [35] Multi-MELO: Unified multimodal model editing with dynamic LoRA
    Chen, Qin
    Yin, Jianghao
    Yu, Lang
    Zhou, Jie
    He, Liang
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 273
  • [36] Learning to Ground Visual Objects for Visual Dialog
    Chen, Feilong
    Chen, Xiuyi
    Xu, Can
    Jiang, Daxin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1081 - 1091
  • [37] Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model
    Chen, Xiaolin
    Song, Xuemeng
    Jing, Liqiang
    Li, Shuo
    Hu, Linmei
    Nie, Liqiang
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2024, 42 (02)
  • [38] Implicit Unlikelihood Training: Improving Neural Text Generation with Reinforcement Learning
    Lagutin, Evgeny
    Gavrilov, Daniil
    Kalaidin, Pavel
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1432 - 1441
  • [39] THE INTERNAL DIALOG OF CULTURALLY DIFFERENT CLIENTS - AN APPLICATION OF THE TRIAD TRAINING MODEL
    IRVIN, R
    PEDERSEN, P
    JOURNAL OF MULTICULTURAL COUNSELING AND DEVELOPMENT, 1995, 23 (01) : 4 - 10
  • [40] A Unified Implicit Dialog Framework for Conversational Commerce
    Feng, Song
    Gunasekara, R. Chulaka
    Shashidhara, Sunil
    Fadnis, Kshitij P.
    Polymenakos, Lazaros C.
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 8200 - 8201