VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG

被引:5
|
作者
Ye, Tong [1 ,2 ]
Si, Shijing [1 ]
Wang, Jianzong [1 ]
Wang, Rui [3 ]
Cheng, Ning [1 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Duke Univ, Durham, NC 27706 USA
关键词
Multi-Modal; Visual Dialog; Patch Embedding; Transformer;
D O I
10.1109/ICASSP43922.2022.9746098
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VUBERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.
引用
收藏
页码:6687 / 6691
页数:5
相关论文
共 50 条
  • [21] Incorporation of Contextual Information into BERT for Dialog Act Classification in Japanese
    Katada, Shun
    Shirai, Kiyoaki
    Okada, Shogo
    16th International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP 2021, 2021,
  • [22] BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling
    Su, Jing
    Dai, Qingyun
    Guerin, Frank
    Zhou, Mian
    COMPUTER SPEECH AND LANGUAGE, 2021, 67
  • [23] Incorporation of Contextual Information into BERT for Dialog Act Classification in Japanese
    Katada, Shun
    Shirai, Kiyoaki
    Okada, Shogo
    16TH INTERNATIONAL JOINT SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE PROCESSING (ISAI-NLP 2021), 2021,
  • [24] KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning®
    Song, Dandan
    Ma, Siyi
    Sun, Zhanchen
    Yang, Sicheng
    Liao, Lejian
    KNOWLEDGE-BASED SYSTEMS, 2021, 230
  • [25] Towards a unified visual framework in a binocular active robot vision system
    Aragon-Camarasa, Gerardo
    Fattah, Haitham
    Siebert, J. Paul
    ROBOTICS AND AUTONOMOUS SYSTEMS, 2010, 58 (03) : 276 - 286
  • [26] A Unified Framework for Multilingual and Code-Mixed Visual Question Answering
    Gupta, Deepak
    Lenka, Pabitra
    Ekbal, Asif
    Bhattacharyya, Pushpak
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 900 - 913
  • [27] Distribution Alignment: A Unified Framework for Long-tail Visual Recognition
    Zhang, Songyang
    Li, Zeming
    Yan, Shipeng
    He, Xuming
    Sun, Jian
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2361 - 2370
  • [28] A unified framework for local population frequency responses in the human visual system
    Podvalny, E.
    Michal, H.
    Noy, N.
    Bickel, S.
    Zion-Golumbic, E. M.
    Davidesco, I.
    Chechik, G.
    Schroeder, C. E.
    Mehta, A.
    Tsodyks, M.
    Malach, R.
    PERCEPTION, 2013, 42 : 235 - 235
  • [29] Toward a Unified Framework for RGB and RGB-D Visual Navigation
    Du, Heming
    Huang, Zi
    Chapman, Scott
    Yu, Xin
    ADVANCES IN ARTIFICIAL INTELLIGENCE, AI 2023, PT II, 2024, 14472 : 363 - 375
  • [30] Adaptive Visual Memory Network for Visual Dialog
    Zhao L.
    Gao L.
    Song J.
    Gao, Lianli (juana.alian@gmail.com), 1600, Univ. of Electronic Science and Technology of China (50): : 749 - 753