Hierarchical Vision and Language Transformer for Efficient Visual Dialog

被引：0

作者：

He, Qiangqiang ^{[1
]}

Zhang, Mujie ^{[1
]}

Zhang, Jie ^{[1
]}

Yang, Shang ^{[1
]}

Wang, Chongjun ^{[1
]}

机构：

[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China

来源：

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VI | 2023年 / 14259卷

基金：

中国国家自然科学基金;

关键词：

Visual Dialog; Hierarchical Transformer; Multi-Modal;

D O I：

10.1007/978-3-031-44223-0_34

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The visual dialog task requires a deep understanding of an image and a dialog history to answer multiple consecutive questions. Existing research focuses on enhancing cross-modal interaction and fusion but often overlooks the computational complexity and higher-level interaction between the two modalities. This paper proposes a hierarchical vision and language Transformer (HVLT) to address these issues. Specifically, HVLT employs a convolution-like design to learn the interaction and fusion of images and text at different levels. We employ a token merging module to aggregate four spatially adjacent image tokens and four temporally adjacent text tokens into one token and use the expanded [CLS] token to fuse image and text information in a new dimension. This hierarchical architecture allows the model to focus on feature maps of different sizes and dialog history at word, phrase, and sentence levels and reduces the time overhead. We tailor two training objectives for HVLT: masked language regression (MLR) and next sentence prediction (NSP), which help the model understand images and language and learn their relationships. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the competitive performance of HVLT. Finally, we visualize the attention to gain insights into how HVLT works in practice, shedding light on its interpretability.

引用

页码：421 / 432

页数：12

共 50 条

[21] TVLT: Textless Vision-Language Transformer
Tang, Zineng
Cho, Jaemin
Nie, Yixin
Bansal, Mohit
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[22] Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization
Huang, Huaibo
Zhou, Xiaoqiang
He, Ran
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[23] Episodic Transformer for Vision-and-Language Navigation
Pashevich, Alexander
Schmid, Cordelia
Sun, Chen
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
[24] Masked Vision-language Transformer in Fashion
Ji, Ge-Peng
Zhuge, Mingchen
Gao, Dehong
Fan, Deng-Ping
Sakaridis, Christos
Gool, Luc Van
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (03) : 421 - 434
[25] Masked Vision-language Transformer in Fashion
Ge-Peng Ji
Mingchen Zhuge
Dehong Gao
Deng-Ping Fan
Christos Sakaridis
Luc Van Gool
Machine Intelligence Research, 2023, 20 : 421 - 434
[26] Green Hierarchical Vision Transformer for Masked Image Modeling
Huang, Lang
You, Shan
Zheng, Mingkai
Wang, Fei
Qian, Chen
Yamasaki, Toshihiko
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[27] Convolutional Embedding Makes Hierarchical Vision Transformer Stronger
Wang, Cong
Xu, Hongmin
Zhang, Xiong
Wang, Li
Zheng, Zhitong
Liu, Haifeng
COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 739 - 756
[28] Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention
Pan, Xuran
Ye, Tianzhu
Xia, Zhuofan
Song, Shiji
Huang, Gao
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2082 - 2091
[29] Integrating language, vision and action for human robot dialog systems
Rickert, Markus
Foster, Mary Ellen
Giuliani, Manuel
By, Tomas
Fanin, Giorgio
Knoll, Alois
UNIVERSAL ACCESS IN HUMAN-COMPUTER INTERACTION: AMBIENT INTERACTION, PT 2, PROCEEDINGS, 2007, 4555 : 987 - +
[30] HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification
Ouyang, Shuyi
Wang, Hongyi
Niu, Ziwei
Bai, Zhenjia
Xie, Shiao
Xu, Yingying
Tong, Ruofeng
Chen, Yen-Wei
Lin, Lanfen
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4768 - 4777

← 1 2 3 4 5 →