VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG

被引:5
|
作者
Ye, Tong [1 ,2 ]
Si, Shijing [1 ]
Wang, Jianzong [1 ]
Wang, Rui [3 ]
Cheng, Ning [1 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Duke Univ, Durham, NC 27706 USA
关键词
Multi-Modal; Visual Dialog; Patch Embedding; Transformer;
D O I
10.1109/ICASSP43922.2022.9746098
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VUBERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.
引用
收藏
页码:6687 / 6691
页数:5
相关论文
共 50 条
  • [1] VD-BERT: A Unified Vision and Dialog Transformer with BERT
    Wang, Yue
    Joty, Shafiq
    Lyu, Michael R.
    King, Irwin
    Xiong, Caiming
    Hoi, Steven C. H.
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 3325 - 3338
  • [2] A Unified Implicit Dialog Framework for Conversational Commerce
    Feng, Song
    Gunasekara, R. Chulaka
    Shashidhara, Sunil
    Fadnis, Kshitij P.
    Polymenakos, Lazaros C.
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 8200 - 8201
  • [3] Probabilistic framework for solving visual dialog
    Patro, Badri N.
    Anupriy
    Namboodiri, Vinay P.
    PATTERN RECOGNITION, 2021, 110
  • [4] Unified Multimodal Model with Unlikelihood Training for Visual Dialog
    Wang, Zihao
    Wang, Junli
    Jiang, Changjun
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4625 - 4634
  • [5] Towards a Unified Framework for Visual Compatibility Prediction
    Singhal, Anirudh
    Chopra, Ayush
    Ayush, Kumar
    Patel, Utkarsh
    Krishnamurthy, Balaji
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 3596 - 3605
  • [6] A unified computational framework for visual attention dynamics
    Zanca, Dario
    Gori, Marco
    Rufa, Alessandra
    MATHEMATICAL MODELLING IN MOTOR NEUROSCIENCE: STATE OF THE ART AND TRANSLATION TO THE CLINIC. GAZE ORIENTING MECHANISMS AND DISEASE, 2019, 249 : 183 - 188
  • [7] A GRANULAR UNIFIED FRAMEWORK FOR A MACHINE VISUAL SYSTEM
    Beldjehem, Mokhtar
    1ST INTERNATIONAL NORTH-AMERICAN SIMULATION TECHNOLOGY CONFERENCE, 2008, : 63 - 69
  • [8] A unified framework for local visual descriptors evaluation
    Kihl, Olivier
    Picard, David
    Gosselin, Philippe-Henri
    PATTERN RECOGNITION, 2015, 48 (04) : 1174 - 1184
  • [9] UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
    Chen, Cheng
    Tan, Zhenshan
    Cheng, Qingrong
    Jiang, Xin
    Liu, Qun
    Zhu, Yudong
    Gu, Xiaodong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18082 - 18091
  • [10] A Unified Framework for Jointly Compressing Visual and Semantic Data
    Liu, Shizhan
    Lin, Weiyao
    Chen, Yihang
    Zhang, Yufeng
    Dai, Wenrui
    See, John
    Xiong, Hong-Kai
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)