VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG

被引：5

作者：

Ye, Tong ^{[1
,2
]}

Si, Shijing ^{[1
]}

Wang, Jianzong ^{[1
]}

Wang, Rui ^{[3
]}

Cheng, Ning ^{[1
]}

Xiao, Jing ^{[1
]}

机构：

[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Peoples R China

[2] Univ Sci & Technol China, Hefei, Peoples R China

[3] Duke Univ, Durham, NC 27706 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Multi-Modal; Visual Dialog; Patch Embedding; Transformer;

D O I：

10.1109/ICASSP43922.2022.9746098

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VUBERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.

引用

页码：6687 / 6691

页数：5

共 50 条

[21] Incorporation of Contextual Information into BERT for Dialog Act Classification in Japanese
Katada, Shun
Shirai, Kiyoaki
Okada, Shogo
16th International Joint Symposium on Artificial Intelligence and Natural Language Processing, iSAI-NLP 2021, 2021,
[22] BERT-hLSTMs: BERT and hierarchical LSTMs for visual storytelling
Su, Jing
Dai, Qingyun
Guerin, Frank
Zhou, Mian
COMPUTER SPEECH AND LANGUAGE, 2021, 67
[23] Incorporation of Contextual Information into BERT for Dialog Act Classification in Japanese
Katada, Shun
Shirai, Kiyoaki
Okada, Shogo
16TH INTERNATIONAL JOINT SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE PROCESSING (ISAI-NLP 2021), 2021,
[24] KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for visual commonsense reasoning®
Song, Dandan
Ma, Siyi
Sun, Zhanchen
Yang, Sicheng
Liao, Lejian
KNOWLEDGE-BASED SYSTEMS, 2021, 230
[25] Towards a unified visual framework in a binocular active robot vision system
Aragon-Camarasa, Gerardo
Fattah, Haitham
Siebert, J. Paul
ROBOTICS AND AUTONOMOUS SYSTEMS, 2010, 58 (03) : 276 - 286
[26] A Unified Framework for Multilingual and Code-Mixed Visual Question Answering
Gupta, Deepak
Lenka, Pabitra
Ekbal, Asif
Bhattacharyya, Pushpak
1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 900 - 913
[27] Distribution Alignment: A Unified Framework for Long-tail Visual Recognition
Zhang, Songyang
Li, Zeming
Yan, Shipeng
He, Xuming
Sun, Jian
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2361 - 2370
[28] A unified framework for local population frequency responses in the human visual system
Podvalny, E.
Michal, H.
Noy, N.
Bickel, S.
Zion-Golumbic, E. M.
Davidesco, I.
Chechik, G.
Schroeder, C. E.
Mehta, A.
Tsodyks, M.
Malach, R.
PERCEPTION, 2013, 42 : 235 - 235
[29] Toward a Unified Framework for RGB and RGB-D Visual Navigation
Du, Heming
Huang, Zi
Chapman, Scott
Yu, Xin
ADVANCES IN ARTIFICIAL INTELLIGENCE, AI 2023, PT II, 2024, 14472 : 363 - 375
[30] Adaptive Visual Memory Network for Visual Dialog
Zhao L.
Gao L.
Song J.
Gao, Lianli (juana.alian@gmail.com), 1600, Univ. of Electronic Science and Technology of China (50): : 749 - 753

← 1 2 3 4 5 →