Hierarchical Vision and Language Transformer for Efficient Visual Dialog

被引:0
|
作者
He, Qiangqiang [1 ]
Zhang, Mujie [1 ]
Zhang, Jie [1 ]
Yang, Shang [1 ]
Wang, Chongjun [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual Dialog; Hierarchical Transformer; Multi-Modal;
D O I
10.1007/978-3-031-44223-0_34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The visual dialog task requires a deep understanding of an image and a dialog history to answer multiple consecutive questions. Existing research focuses on enhancing cross-modal interaction and fusion but often overlooks the computational complexity and higher-level interaction between the two modalities. This paper proposes a hierarchical vision and language Transformer (HVLT) to address these issues. Specifically, HVLT employs a convolution-like design to learn the interaction and fusion of images and text at different levels. We employ a token merging module to aggregate four spatially adjacent image tokens and four temporally adjacent text tokens into one token and use the expanded [CLS] token to fuse image and text information in a new dimension. This hierarchical architecture allows the model to focus on feature maps of different sizes and dialog history at word, phrase, and sentence levels and reduces the time overhead. We tailor two training objectives for HVLT: masked language regression (MLR) and next sentence prediction (NSP), which help the model understand images and language and learn their relationships. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the competitive performance of HVLT. Finally, we visualize the attention to gain insights into how HVLT works in practice, shedding light on its interpretability.
引用
收藏
页码:421 / 432
页数:12
相关论文
共 50 条
  • [21] TVLT: Textless Vision-Language Transformer
    Tang, Zineng
    Cho, Jaemin
    Nie, Yixin
    Bansal, Mohit
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [22] Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization
    Huang, Huaibo
    Zhou, Xiaoqiang
    He, Ran
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [23] Episodic Transformer for Vision-and-Language Navigation
    Pashevich, Alexander
    Schmid, Cordelia
    Sun, Chen
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
  • [24] Masked Vision-language Transformer in Fashion
    Ji, Ge-Peng
    Zhuge, Mingchen
    Gao, Dehong
    Fan, Deng-Ping
    Sakaridis, Christos
    Gool, Luc Van
    MACHINE INTELLIGENCE RESEARCH, 2023, 20 (03) : 421 - 434
  • [25] Masked Vision-language Transformer in Fashion
    Ge-Peng Ji
    Mingchen Zhuge
    Dehong Gao
    Deng-Ping Fan
    Christos Sakaridis
    Luc Van Gool
    Machine Intelligence Research, 2023, 20 : 421 - 434
  • [26] Green Hierarchical Vision Transformer for Masked Image Modeling
    Huang, Lang
    You, Shan
    Zheng, Mingkai
    Wang, Fei
    Qian, Chen
    Yamasaki, Toshihiko
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [27] Convolutional Embedding Makes Hierarchical Vision Transformer Stronger
    Wang, Cong
    Xu, Hongmin
    Zhang, Xiong
    Wang, Li
    Zheng, Zhitong
    Liu, Haifeng
    COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 739 - 756
  • [28] Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention
    Pan, Xuran
    Ye, Tianzhu
    Xia, Zhuofan
    Song, Shiji
    Huang, Gao
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2082 - 2091
  • [29] Integrating language, vision and action for human robot dialog systems
    Rickert, Markus
    Foster, Mary Ellen
    Giuliani, Manuel
    By, Tomas
    Fanin, Giorgio
    Knoll, Alois
    UNIVERSAL ACCESS IN HUMAN-COMPUTER INTERACTION: AMBIENT INTERACTION, PT 2, PROCEEDINGS, 2007, 4555 : 987 - +
  • [30] HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification
    Ouyang, Shuyi
    Wang, Hongyi
    Niu, Ziwei
    Bai, Zhenjia
    Xie, Shiao
    Xu, Yingying
    Tong, Ruofeng
    Chen, Yen-Wei
    Lin, Lanfen
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4768 - 4777