Unified Multimodal Model with Unlikelihood Training for Visual Dialog

被引:6
|
作者
Wang, Zihao [1 ]
Wang, Junli [1 ]
Jiang, Changjun [1 ]
机构
[1] Tongji Univ, Natl Prov Minist Joint Collaborat Innovat Ctr Fin, Key Lab Embedded Syst & Serv Comp, Minist Educ, Shanghai, Peoples R China
关键词
Visual Dialog; Vision and Language; Unlikelihood Training;
D O I
10.1145/3503161.3547974
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The task of visual dialog requires a multimodal chatbot to answer sequential questions from humans about image content. Prior work performs the standard likelihood training for answer generation on the positive instances (involving correct answers). However, the likelihood objective often leads to frequent and dull outputs and fails to exploit the useful knowledge from negative instances (involving incorrect answers). In this paper, we propose a Unified Multimodal Model with UnLikelihood Training, named UniMM-UL, to tackle this problem. First, to improve visual dialog understanding and generation by multi-task learning, our model extends ViLBERT from only supporting answer discrimination to holding both answer discrimination and answer generation seamlessly by different attention masks. Specifically, in order to make the original discriminative model compatible with answer generation, we design novel generative attention masks to implement the autoregressive Masked Language Modeling (autoregressive MLM) task. And to attenuate the adverse effects of the likelihood objective, we exploit unlikelihood training on negative instances to make the model less likely to generate incorrect answers. Then, to utilize dense annotations, we adopt different fine-tuning methods for both generating and discriminating answers, rather than just for discriminating answers as in the prior work. Finally, on the VisDial dataset, our model achieves the best generative results (69.23 NDCG score). And our model also yields comparable discriminative results with the state-of-the-art in both single-model and ensemble settings (75.92 and 76.17 NDCG scores).
引用
收藏
页码:4625 / 4634
页数:10
相关论文
共 50 条
  • [41] Towards Visual Dialog for Radiology
    Kovaleva, Olga
    Shivade, Chaitanya
    Kashyap, Satyananda
    Kanjaria, Karina
    Coy, Adam
    Ballah, Deddeh
    Wu, Joy
    Guo, Yufan
    Karargyris, Alexandros
    Beymer, David
    Rumshisky, Anna
    Mukherjee, Vandana
    19TH SIGBIOMED WORKSHOP ON BIOMEDICAL LANGUAGE PROCESSING (BIONLP 2020), 2020, : 60 - 69
  • [42] VISUAL DIALOG WITH TARGETED OBJECTS
    Wang, Qiang
    Han, Yahong
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1564 - 1569
  • [43] Image Understanding for Visual Dialog
    Cho, Yeongsu
    Kim, Incheol
    JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2019, 15 (05): : 1171 - 1178
  • [44] Multimodal dialog system based on statistical models
    Sanchis, E.
    Hurtado, L-F.
    Gomez, J. A.
    Garcia, F.
    Pastor, J.
    Planells, J.
    Segarra, E.
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2011, (47): : 325 - 326
  • [45] ENGINEERING USER MODELS TO ENHANCE MULTIMODAL DIALOG
    CHAPPEL, HR
    WILSON, MD
    CAHOUR, B
    JARVINEN, P
    KAZMAN, R
    SCHNEIDERHUFSCHMIDT, M
    KAKEHI, K
    CAUTAZ, J
    STIEGLER, H
    HARRISON, M
    COCKTON, G
    IFIP TRANSACTIONS A-COMPUTER SCIENCE AND TECHNOLOGY, 1992, 18 : 297 - 315
  • [46] A unified framework for multimodal retrieval
    Rafailidis, D.
    Manolopoulou, S.
    Daras, P.
    PATTERN RECOGNITION, 2013, 46 (12) : 3358 - 3370
  • [47] Robust multimodal dialog management for mobile environments
    Ko, Jeonwoo
    Murase, Fumihiko
    Mitamura, Teruko
    Nyberg, Eric
    Hataoka, Nobuo
    Sagawa, Hirohiko
    Obuchi, Yasunari
    Tateishi, Masahiko
    Akahori, Ichiro
    ADVANCES FOR IN-VEHICLE AND MOBILE SYSTEMS: CHALLENGES FOR INTERNATIONAL STANDARDS, 2007, : 265 - 277
  • [48] Modeling output in the EMBASSI multimodal dialog system
    Elting, C
    Möhler, G
    FOURTH IEEE INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, PROCEEDINGS, 2002, : 111 - 116
  • [49] Augmented Reality Dialog Interface for Multimodal Teleoperation
    Pereira, Andre
    Carter, Elizabeth J.
    Leite, Iolanda
    Mars, John
    Lehman, Jill Fain
    2017 26TH IEEE INTERNATIONAL SYMPOSIUM ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION (RO-MAN), 2017, : 764 - 771
  • [50] Efficient Sample Retrieval Techniques for Multimodal Model Training
    Tang X.
    Wu S.
    Hou J.
    Chen G.
    Ruan Jian Xue Bao/Journal of Software, 2024, 35 (03): : 1125 - 1139