Hierarchical Vision and Language Transformer for Efficient Visual Dialog

被引：0

作者：

He, Qiangqiang ^{[1
]}

Zhang, Mujie ^{[1
]}

Zhang, Jie ^{[1
]}

Yang, Shang ^{[1
]}

Wang, Chongjun ^{[1
]}

机构：

[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China

来源：

ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VI | 2023年 / 14259卷

基金：

中国国家自然科学基金;

关键词：

Visual Dialog; Hierarchical Transformer; Multi-Modal;

D O I：

10.1007/978-3-031-44223-0_34

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The visual dialog task requires a deep understanding of an image and a dialog history to answer multiple consecutive questions. Existing research focuses on enhancing cross-modal interaction and fusion but often overlooks the computational complexity and higher-level interaction between the two modalities. This paper proposes a hierarchical vision and language Transformer (HVLT) to address these issues. Specifically, HVLT employs a convolution-like design to learn the interaction and fusion of images and text at different levels. We employ a token merging module to aggregate four spatially adjacent image tokens and four temporally adjacent text tokens into one token and use the expanded [CLS] token to fuse image and text information in a new dimension. This hierarchical architecture allows the model to focus on feature maps of different sizes and dialog history at word, phrase, and sentence levels and reduces the time overhead. We tailor two training objectives for HVLT: masked language regression (MLR) and next sentence prediction (NSP), which help the model understand images and language and learn their relationships. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the competitive performance of HVLT. Finally, we visualize the attention to gain insights into how HVLT works in practice, shedding light on its interpretability.

引用

页码：421 / 432

页数：12

共 50 条

[41] KAT: A Knowledge Augmented Transformer for Vision-and-Language
Gui, Liangke
Wang, Borui
Huang, Qiuyuan
Hauptmann, Alexander
Bisk, Yonatan
Gao, Jianfeng
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 956 - 968
[42] MAGVLT: Masked Generative Vision-and-Language Transformer
Kim, Sungwoong
Jo, Daejin
Lee, Donghoon
Kim, Jongmin
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23338 - 23348
[43] FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback
Goenka, Sonam
Zheng, Zhaoheng
Jaiswal, Ayush
Chada, Rakesh
Wu, Yue
Hedau, Varsha
Natarajan, Pradeep
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14085 - 14095
[44] Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention
Gao, Peng
Zhang, Xin-Yue
Yang, Xiao-Li
Ni, Jian-Cheng
Wang, Fei
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 161 - 164
[45] UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
Chen, Cheng
Tan, Zhenshan
Cheng, Qingrong
Jiang, Xin
Liu, Qun
Zhu, Yudong
Gu, Xiaodong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18082 - 18091
[46] Data Efficient Masked Language Modeling for Vision and Language
Bitton, Yonatan
Stanovsky, Gabriel
Elhadad, Michael
Schwartz, Roy
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 3013 - 3028
[47] Siamese hierarchical feature fusion transformer for efficient tracking
Dai, Jiahai
Fu, Yunhao
Wang, Songxin
Chang, Yuchun
FRONTIERS IN NEUROROBOTICS, 2022, 16
[48] FlexFormer: Flexible Transformer for efficient visual recognition *
Fan, Xinyi
Liu, Huajun
PATTERN RECOGNITION LETTERS, 2023, 169 : 95 - 101
[49] VTST: Efficient Visual Tracking With a Stereoscopic Transformer
Gu, Fengwei
Lu, Jun
Cai, Chengtao
Zhu, Qidan
Ju, Zhaojie
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (03): : 2401 - 2416
[50] Learning language to symbol and language to vision mapping for visual grounding
He, Su
Yang, Xiaofeng
Lin, Guosheng
IMAGE AND VISION COMPUTING, 2022, 122

← 1 2 3 4 5 →